0% found this document useful (0 votes)
46 views397 pages

Geometric Theory of Information: Frank Nielsen

Uploaded by

timothyhses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views397 pages

Geometric Theory of Information: Frank Nielsen

Uploaded by

timothyhses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 397

Signals and Communication Technology

Frank Nielsen Editor

Geometric
Theory
of Information
Signals and Communication Technology

For further volumes:


http://www.springer.com/series/4748
Frank Nielsen
Editor

Geometric Theory
of Information

123
Editor
Frank Nielsen
Sony Computer Science Laboratories Inc
Shinagawa-Ku, Tokyo
Japan

and
Laboratoire d’Informatique (LIX)
Ecole Polytechnique
Palaiseau Cedex
France

ISSN 1860-4862 ISSN 1860-4870 (electronic)


ISBN 978-3-319-05316-5 ISBN 978-3-319-05317-2 (eBook)
DOI 10.1007/978-3-319-05317-2
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014939042

 Springer International Publishing Switzerland 2014


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publisher’s location, in its current version, and permission for use must
always be obtained from Springer. Permissions for use may be obtained through RightsLink at the
Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To Audrey Léna and Julien Léo,
To my whole family, heartily
Preface

The first conference of the Geometric Sciences of Information (GSI, website at


http://www.gsi2013.org/ includes slides and photos) took place downtown Paris
(France), in August 2013. The call for papers received an enthusiastic worldwide
response, resulting in about a 100 accepted submissions of an average length of
eight pages, and organized into the following broad topics:
• Computational Information Geometry,
• Hessian/Symplectic Information Geometry,
• Optimization on Matrix Manifolds,
• Probability on Manifolds,
• Optimal Transport Geometry,
• Divergence Geometry and Ancillarity,
• Machine/Manifold/Topology Learning,
• Tensor-Valued Mathematical Morphology,
• Differential Geometry in Signal Processing,
• Geometry of Audio Processing,
• Geometry for Inverse Problems,
• Shape Spaces: Geometry and Statistic,
• Geometry of Shape Variability,
• Relational Metric,
• Discrete Metric Spaces,
• Etc.
The GSI proceedings has been published as a (thick!) Springer LNCS (volume
number 8085, XX, 879 pages, 157 illustrations). Since we could not accommodate
long papers in GSI, we decided to solicit renown researchers to contribute to a full
length chapter on the latest advances in information geometry and some of its
applications (like computational anatomy, image morphology, statistics, and
textures, to cite a few examples). Those selected applications emphasize on
algorithmic aspects when programming the Methods of Information Geometry [1].
The 13 chapters of this book have been organized as follows:

vii
viii Preface

• Divergence Functions and Geometric Structures They Induce on a Manifold by


Jun Zhang,
• Geometry on Positive Definite Matrices Deformed by V-Potentials and Its
Submanifold Structure by Atsumi Ohara and Shinto Eguchi,
• Hessian Structures and Divergence Functions on Deformed Exponential
Families by Hiroshi Matsuzoe and Masayuki Henmi,
• Harmonic Maps Relative to a-Connections by Keiko Uohashi,
• A Riemannian Geometry in the q-Exponential Banach Manifold Induced by
q-Divergences by Héctor R. Quiceno, Gabriel I. Loaiza and Juan C. Arango,
• Computational Algebraic Methods in Efficient Estimation by Kei Kobayashi and
Henry P. Wynn,
• Eidetic Reduction of Information Geometry Through Legendre Duality of Koszul
Characteristic Function and Entropy: From Massieu–Duhem Potentials to
Geometric Souriau Temperature and Balian Quantum Fisher Metric by Frédéric
Barbaresco,
• Distances on Spaces of High-Dimensional Linear Stochastic Processes:
A Survey, by Bijan Afsari and René Vidal,
• Discrete Ladders for Parallel Transport in Transformation Groups with an
Affine Connection Structure by Marco Lorenzi and Xavier Pennec,
• A Diffeomorphic Iterative Centroid Method, by Claire Cury, Joan A. Glaunès
and Olivier Colliot,
• Hartigan’s Method for k-MLE: Mixture Modeling with Wishart Distributions and
Its Application to Motion Retrieval by Christophe Saint-Jean and Frank Nielsen,
• Morphological Processing of Univariate Gaussian Distribution-Valued Images
Based on Poincaré Upper-Half Plane Representation by Jesús Angulo and
Santiago Velasco-Forero,
• Dimensionality Reduction for Classification of Stochastic Texture Images by
C. T. J. Dodson and W. W. Sampson.
There is an exciting time ahead for computational information geometry in
studying the fundamental concepts and relationships of Information, Geometry,
and Computation!

Acknowledgments

First of all, I would like to thank the chapter contributors for providing us with the
latest advances in information geometry, its computational methods, and appli-
cations. I express my gratitude to the peer reviewers for their careful feedback that
led to this polished, revised work. Each chapter received from two to eight review
reports, with an average number of about three to five reviews per chapter.
I thank the following reviewers (in alphabetical order of their first name):
Akimichi Takemura, Andrew Wood, Anoop Cherian, Arnaud Dessein, Atsumi
Ohara, Bijan Afsari, Frank Critchley, Frank Nielsen, Giovanni Pistone, Hajime
Preface ix

Urakawa, Hirohiko Shima, Hiroshi Matsuzoe, Hitoshi Furuhata, Isabelle Bloch,


Keiko Uohashi, Lipeng Ning, Manfred Deistler, Masatoshi Funabashi, Mauro
Dalla-Mura, Olivier Alata, Richard Nock, Silvère Bonnabel, Stéphanie All-
assonnière, Stefan Sommer, Stephen Marsland, Takashi Kurose, Tryphon
T. Georgiou, Yoshihiro Ohnita, and Yu Fujimoto.
I would also like to reiterate my warmest thanks to our scientific, organizing,
and financial sponsors: CNRS (GdRs MIA and Maths and Entreprises), Ecole des
Mines de Paris, Supelec, Université Paris-Sud, Institut Mathématique de
Bordeaux, Société de l’Electricité, de l’Électronique et des technologies de
l’information et de la communication (SEE), Société Mathématique de France
(SMF), Sony Computer Science Laboratories, and Thales.
Last but not least, I am personally indebted to Dr. Mario Tokoro and
Dr. Hiroaki Kitano (Sony Computer Science Laboratories, Inc) for their many
encouragements and continuing guidance over the years.

Tokyo, January 2014 Frank Nielsen

Reference

1. Amari, S., Nagaoka, H.: Method of information geometry, AMS Monograph. Oxford
University Press, Oxford (2000)
Contents

1 Divergence Functions and Geometric Structures


They Induce on a Manifold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Jun Zhang

2 Geometry on Positive Definite Matrices Deformed


by V-Potentials and Its Submanifold Structure . . . . . . . . . . . . . . 31
Atsumi Ohara and Shinto Eguchi

3 Hessian Structures and Divergence Functions on Deformed


Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Hiroshi Matsuzoe and Masayuki Henmi

4 Harmonic Maps Relative to a-Connections . . . . . . . . . . . . . . . . . 81


Keiko Uohashi

5 A Riemannian Geometry in the q-Exponential Banach


Manifold Induced by q-Divergences . . . . . . . . . . . . . . . . . . . . . . 97
Héctor R. Quiceno, Gabriel I. Loaiza and Juan C. Arango

6 Computational Algebraic Methods in Efficient Estimation . . . . . . 119


Kei Kobayashi and Henry P. Wynn

7 Eidetic Reduction of Information Geometry Through Legendre


Duality of Koszul Characteristic Function and Entropy:
From Massieu–Duhem Potentials to Geometric Souriau
Temperature and Balian Quantum Fisher Metric . . . . . . . . . . . . 141
Frédéric Barbaresco

8 Distances on Spaces of High-Dimensional Linear Stochastic


Processes: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Bijan Afsari and René Vidal

xi
xii Contents

9 Discrete Ladders for Parallel Transport in Transformation


Groups with an Affine Connection Structure . . . . . . . . . . . . . . . . 243
Marco Lorenzi and Xavier Pennec

10 Diffeomorphic Iterative Centroid Methods for Template


Estimation on Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Claire Cury, Joan Alexis Glaunès and Olivier Colliot

11 Hartigan’s Method for k-MLE: Mixture Modeling


with Wishart Distributions and Its Application
to Motion Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Christophe Saint-Jean and Frank Nielsen

12 Morphological Processing of Univariate Gaussian


Distribution-Valued Images Based on Poincaré Upper-Half
Plane Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Jesús Angulo and Santiago Velasco-Forero

13 Dimensionality Reduction for Classification of Stochastic


Texture Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
C. T. J. Dodson and W. W. Sampson

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Chapter 1
Divergence Functions and Geometric Structures
They Induce on a Manifold

Jun Zhang

Abstract Divergence functions play a central role in information geometry. Given a


manifold M, a divergence function D is a smooth, nonnegative function on the prod-
uct manifold M × M that achieves its global minimum of zero (with semi-positive
definite Hessian) at those points that form its diagonal submanifold ΔM ⊂ M × M.
In this chapter, we review how such divergence functions induce (i) a statistical
structure (i.e., a Riemannian metric with a pair of conjugate affine connections) on
M; (ii) a symplectic structure on M × M if they are “proper”; (iii) a Kähler struc-
ture on M × M if they further satisfy a certain condition. It is then shown that the
class of DΦ -divergence functions [23], as induced by a strictly convex function Φ
on M, satisfies all these requirements and hence makes M × M a Kähler manifold
(with Kähler potential given by Φ). This provides a larger context for the α-Hessian
structure induced by the DΦ -divergence on M, which is shown to be equiaffine admit-
ting α-parallel volume forms and biorthogonal coordinates generated by Φ and its
convex conjugate Φ ∗ . As the α-Hessian structure is dually flat for α = ±1, the DΦ -
divergence provides richer geometric structures (compared to Bregman divergence)
to the manifold M on which it is defined.

1.1 Introduction

Divergence functions (also called “contrast functions”, “york”) are non-symmetric


measurements of proximity. They play a central role in statistical inference, machine
learning, optimization, and many other fields. The most familiar examples
include Kullback-Leibler divergence, Bregman divergence [4], α-divergence [1],
f -divergence [6], etc. Divergence functions are also a key construct of information

J. Zhang (B)
Department of Psychology and Department of Mathematics, University of Michigan,
Ann Arbor, MI 48109, USA
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 1


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_1,
© Springer International Publishing Switzerland 2014
2 J. Zhang

geometry. Just as L2 -distance is associated with Euclidean geometry, Bregman


divergence and Kullback-Leibler divergence are associated with a pair of flat struc-
tures (where flatness means free of torsion and free of curvature) that are “dual” to
each other; this is called Hessian geometry [18, 19] and it is the dualistic extension of
the Euclidean geometry. So just as Riemannian geometry extends Euclidean geom-
etry by allowing non-trivial metric structure, Hessian geometry extends Euclidean
geometry by allowing non-trivial affine connections that come in pairs. The pairing
of connections are with respect to a Riemannian metric g, which is uniquely specified
in the case of Hessian geometry; yet the metric-induced Levi-Civita connection has
non-zero curvature in general. The apparent inconvenience is offset by the existence
of biorthogonal coordinates in any dually flat (i.e., Hessian) structure and a canon-
ical divergence, along with the tools of convex analysis which is powerful in many
practical applications.
In a quite general setting, any divergence function induces a Riemannian metric
and a pair of torsion-free connections on the manifold where they are defined [8]. This
so-called statistical structure is at the core of information geometry. Recently, other
geometric structures induced by divergence functions are being investigated, includ-
ing conformal structure [17], symplectic structure [3, 28], and complex structures
[28].
The goal of this chapter is to review the relationship between divergence function
and various information geometric structures. In Sect. 1.2, we provide background
materials of various geometric structures on a manifold. In Sect. 1.3, we show how
these structures can be induced from a divergence function. Starting from a gen-
eral divergence function which always induces a statistical structure, we define the
notion of “properness” for it to be a generating function of a symplectic structure.
Imposing a further condition leads to complexification of the product manifold where
divergence functions are defined. In Sect. 1.4, we show that a quite broad class of
divergence functions, DΦ -divergence functions [23] as induced by a strictly convex
function, satisfies all these requirements and induces a Kähler structure (Riemannian
and complex structures simultaneously) on the tangent bundle. Therefore, just as the
full-fledged α-Hessian geometry extends the dually-flat Hessian manifold (α = ±1),
DΦ -divergence generalizes Bregman divergence in the “nicest” way possible.
Section 1.5 closes with a summary of this approach to information geometric struc-
tures through divergence functions.

1.2 Background: Structures on Smooth Manifolds

1.2.1 Differentiable Manifold: Metric and Connection Structures


on TM

A differentiable manifold M is a space which locally “looks like” a Euclidean


space Rn . By “looks like”, we mean that for any base (reference) point x ∈ M, there
1 Divergence Functions and Geometric Structures 3

exists a bijective mapping (“coordinate functions”) between the neighborhood of x


(i.e., a patch of the manifold) and a subset V of Rn . By locally, we mean that various
such mappings must be smoothly related to one another (if they are centered at the
same reference point) or consistently glued together (if they are centered at different
reference points). Globally, they must cover the entire manifold. Below, we assume
that a coordinate system is chosen such that each point is indexed by a vector in V ,
with the origin as the reference point.
A manifold is specified with certain structures. First, there is an inner-product
structure associated with tangent spaces of the manifold. This is given by the metric
2-tensor field g which is, when evaluated at each location x, a symmetric bilinear
form g(·, ·) of tangent vectors X, Y ∈ Tx (M)  Rn such that g(X, X) is always
positive for all non-zero vector X ∈ V . In local “holonomic” coordinates1 with
 bases
∂i ≡ ∂/∂x , i = 1, . . . , n, (i.e., X, Y are expressed as X = i X i ∂i , Y = i Y i ∂i ),
i

the components of g are denoted as

gij (x) = g(∂i , ∂j ). (1.1)

Metric tensor allows us to define distance on a manifold as shortest curve (called


“geodesic”) connecting two points, to measure angle and hence define orthogonal-
ity of vectors—projections of vectors to a lower dimensional submanifold become
possible once a metric is given. Metric tensor also provides a linear isomorphism of
tangent space with cotangent space at any point on the manifold.
Second, there is a structure implementing the notion of “parallelism” of vector
fields and curviness of a manifold. This is given by the affine (linear) connection ∇,
mapping two vector fields X and Y to a third one denoted by ∇Y X: (X, Y ) ◦→ ∇Y X.
Intuitively, it represents the “intrinsic” difference of a tangent vector X(x) at point
x and another tangent vector X(x ∅ ) at a nearby point x ∅ , which is connected to x in
the direction given by the tangent vector Y (x). Here “intrinsic” means that vector
comparison across two neighboring points of the manifold is through a process called
“parallel transport,” whereby vector components are adjusted as the vector moves
across points on the base manifold. Under the local coordinate system with bases ∂i ≡
∂/∂x i , components of ∇ can be written out in its “contravariant” form denoted Γijl (x)

∇∂i ∂j = Γijl ∂l . (1.2)
l

Under coordinate transform x ◦→ x̃, the new coefficients Γ are related to old ones
Γ via ⎛ ⎞
  ∂x i ∂x j ∂ 2 xk ∂ x̃ l
Γmn
l
(x̃) = ⎝ Γ k
(x) + ⎠ ; (1.3)
∂ x̃ m ∂ x̃ n ij ∂ x̃ m ∂ x̃ n ∂x k
k i,j

1 A holonomic coordinate system means that the coordinates have been properly “scaled” in unit-
length with respect to each other such that the directional derivatives commute: their Lie bracket
[∂i , ∂j ] = ∂i ∂j − ∂j ∂i = 0, i.e., the mixed partial derivatives are exchangeable in their order of
application.
4 J. Zhang

A curve whose tangent vectors are parallel along the curve is said to be
“auto-parallel”.
As a primitive on a manifold, affine connections can be characterized in terms of
their (i) torsion and (ii) curvature. The torsion T of a connection Γ , which is a tensor
itself, is given by the asymmetric part of the connection T (∂i , ∂j ) = ∇∂i ∂j − ∇∂j ∂i =
 k
k Tij ∂k , where Tij is its local representation given as
k

Tijk (x) = Γijk (x) − Γjik (x).

The curviness/flatness of a connection Γ is described by the Riemann curvature


tensor R, defined as

R(∂i , ∂j )∂k = (∇∂i ∇∂j − ∇∂j ∇∂i )∂k .


 l
Writing R(∂i , ∂j )∂k = l Rkij ∂l and substituting (1.2), the components of the
Riemann curvature tensor are2

∂Γjkl (x) ∂Γikl (x)  l 


l
Rkij (x) = − + Γim (x)Γjkm (x) − Γjm
l
(x)Γikm (x).
∂x i ∂x j
m m

l is anti-symmetric when i ←→ j.
By definition, Rkij
A connection is said to be flat when Rkij l (x) ≡ 0 and T k ≡ 0. Note that this is a
ij
tensorial condition, so that the flatness of a connection ∇ is a coordinate-independent
property even though the local expression of the connection (in terms of Γ ) is
coordinate-dependent. For any flat connection, there exists a local coordinate system
under which Γijk (x) ≡ 0 in a neighborhood; this is the affine coordinate for the given
flat connection.
In the above discussions, metric and connections are treated as separate structures
on a manifold. When both are defined on the same manifold, then it is convenient to
express affine connection Γ in its “covariant” form

Γij,k = g(∇∂i ∂j , ∂k ) = glk Γijl . (1.4)
l

Though Γijk is the more primitive quantity that does not involve metric, Γij,k represents
the projection of Γ onto the manifold spanned by the bases ∂k . The covariant form
of Riemann curvature is (c.f. footnote 2)

Rlkij = glm Rkij
m
.
m

2 This component-wise notation of Riemann curvature tensor followed standard differential geome-
try textbook, such as
[16]. On the other hand,
information geometers, such as [2], adopt the notation
that R(∂i , ∂j )∂k = l Rijk
l ∂ , with R
l ijkl = l Rijk gml .
m
1 Divergence Functions and Geometric Structures 5

When the connection is torsion free, Rlkij is anti-symmetric when i ←→ j or when


k ←→ l, andsymmetric when (i, j) ←→ (l, k). It is related to the Ricci tensor Ric
via Rickj = i,l Rlkij g il .

1.2.2 Coupling Between Metric and Connection: Statistical


Structure

A fundamental theorem of Riemannian geometry states that given a metric, there is


a unique connection (among the class of torsion-free connections) that “preserves”
the metric, i.e., the following condition is satisfied

⎧∂k ∂i , ∂j ) + g(∂i , ∇
∂k g(∂i , ∂j ) = g(∇ ⎧∂k ∂j ). (1.5)

⎧ is called the Levi-Civita connection. Its component


Such a connection, denoted as ∇,
forms, called Christoffel symbols, are specified by the components of the metric
tensor as (“Christoffel symbols of the second kind”)

 g kl ⎨ ∂gil ∂gjl ∂gij



Γ⎧ijk = + − .
2 ∂x j ∂x i ∂x l
l

and (“Christoffel symbols of the first kind”)


⎨ ⎩
1 ∂gik ∂gjk ∂gij
Γ⎧ij,k = + − k .
2 ∂x j ∂x i ∂x

The Levi-Civita connection Γ⎧ is compatible with the metric g, in the sense that it treats
tangent vectors of the shortest curves on a manifold as being parallel (equivalently
speaking, it treats geodesics as auto-parallel curves).
It turns out that one can define a kind of “compatibility” relation more general
than expressed by (1.5), by introducing the notion of “conjugacy” (denoted by ∗)
between two connections. A connection ∇ ∗ is said to be conjugate (or dual) to ∇
with respect to g if

∂k g(∂i , ∂j ) = g(∇∂k ∂i , ∂j ) + g(∂i , ∇∂∗k ∂j). (1.6)

Clearly, (∇ ∗ )∗ = ∇. Moreover, ∇,⎧ which satisfies (1.5), is special in the sense that
⎧ ∗ = ∇.
it is self-conjugate (∇) ⎧
Because metric tensor g provides a one-to-one mapping between points in the
tangent space (i.e., vectors) and points in the cotangent space (i.e., co-vectors), (1.6)
can also be seen as characterizing how co-vector fields are to be parallel-transported
in order to preserve their dual pairing ≤·, · with vector fields.
6 J. Zhang

Writing out (1.6):


∂gij ∗
= Γki,j + Γkj,i , (1.7)
∂x k
where analogous to (1.2) and (1.4),

∇∂∗i ∂j = Γij∗l ∂l
l

so that 

Γkj,i = g(∇∂∗j ∂k , ∂i ) = gil Γkj∗l .
l

There is an alternative way of imposing “compatibility” condition between a


metric g and a connection ∇, through investigating the behavior of how the metric
tensor g behaves under ∇. We introduce a 3-tensor field, called “cubic form”, as the
covariant derivative of g: C = ∇g, or in component forms

C(∂i , ∂j , ∂k ) = (∇∂k g)(∂i , ∂j ) = ∂k g(∂i , ∂j ) − g(∇∂k ∂i , ∂j ) − g(∂i , ∇∂k ∂j ).

Writing out the above:

∂gij ∗
Cijk = − Γki,j − Γkj,i (= Γkj,i − Γkj,i ).
∂xk

From its definition, Cijk = Cjik , that is, symmetric with respective to its first two
indices. It can be further shown that:

Cijk − Cikj = gil (Tjkl − Tjk∗l )
l

where T , T ∗ are torsions of ∇ and ∇ ∗ , respectively. Therefore, Cijk = Cikj ,


and hence C is totally symmetric in all (pairwise permutation of) indices, when
Tjkl = Tjk∗l . So conceptually, requiring Cijk to be totally symmetric imposes a com-
patibility condition between g and ∇, making them the so-called “Codazzi pair”
(see [20]). The Codazzi pairing generalizes the Levi-Civita coupling whose corre-
sponding cubic form Cijk is easily seen to be identically zero. Lauritzen [10] defined
a “statistical manifold” (M, g, ∇) to be a manifold M equipped with g and ∇ such
that (i) ∇ is torsion free; (ii) ∇g ≡ C is totally symmetric. Equivalently, a manifold
is said to have statistical structure when the conjugate connection ∇ ∗ (with respect
to g) of a torsion-free connection ∇ is also torsion-free. In this case, ∇ ∗ g = −C,
and that the Levi-Civita connection ∇ˆ = (∇ + ∇ ∗ )/2.
Two torsion-free connections Γ and Γ ∅ are said to be projectively equivalent if
there exists a function τ such that:

Γij∅k = Γijk + δik (∂j τ ) + δjk (∂i τ ),


1 Divergence Functions and Geometric Structures 7

where δik is the Kronecker delta. When two connections are projectively equivalent,
their corresponding auto-parallel curves have identical shape (i.e., considered as
unparameterized curves); these so-called “pre-geodesics” differ only by a change of
parameterization τ .
Two torsion-free connections Γ and Γ ∅ are said to be dual-projectively equivalent
if there exists a function τ such that:

Γij,k = Γij,k − gij (∂k τ ).

When two connections are dual-projectively equivalent, then their conjugate con-
nections (with respect to g) have identical pre-geodesics (identical shape).
Recall that when the two Riemannian metric g, g ∅ are conformally equivalent, i.e.,
there exists a function τ such that

gij∅ = e2τ gij ,

⎤∅ and Γ⎧ are related via


then their respective Levi-Civita connections Γ

⎤∅ ij,k = Γ⎧ij,k − (∂k τ )gij + (∂j τ )gik + (∂i τ )gjk .


Γ

(This relation is obtained by directly substituting in the expressions of the corre-


sponding Levi-Civita connections.) This motivates the definition of the more general
notion of conformally-projectively equivalent of two statistical structures (M, g, Γ )
and (M, g ∅ , Γ ∅ ), through the existence of two functions ψ, φ such that:

gij∅ = eψ+φ gij (1.8)



Γij,k = Γij,k − (∂k ψ)gij + (∂j φ)gik + (∂i φ)gjk . (1.9)

When φ = const (or ψ = const), then the corresponding connections are projectively
(dual-projectively, resp) equivalent.

1.2.3 Equiaffine Structure and Parallel Volume Form

For a restrictive set of connections, called “equiaffine” connections, the manifold


M may admit, in a unique way, a volume form Ω(x) that is “parallel” under the
given connection. Here, a volume form is a skew-symmetric multilinear map from n
linearly independent vectors to a non-zero scalar at any point x ∈ M, and “parallel”
is in the sense that ∇Ω = 0, or (∂i Ω)(∂1 , . . . , ∂n ) = 0 where


n
(∂i Ω)(∂1 , . . . , ∂n ) ≡ ∂i (Ω(∂1 , . . . , ∂n )) − Ω(. . . , ∇∂i ∂l , . . .).
l=1
8 J. Zhang

Applying (1.2), the equiaffine condition becomes


⎥ ⎫

n 
n
∂i (Ω(∂1 , . . . , ∂n )) = Ω ..., Γilk ∂k , . . .
l=1 k=1
n n 
n
= Γilk δkl Ω(∂1 , . . . , ∂n ) = Ω(∂1 , . . . , ∂n ) Γill
l=1 k=1 l=1

or
 ∂ log Ω(x)
Γill (x) = . (1.10)
∂x i
l

Whether or not a connection is equiaffine is related to the so-called Ricci tensor


Ric, defined as the contraction of the Riemann curvature tensor R

Ricij (x) = k
Rikj (x). (1.11)
k

For a torsion-free connection Γijk = Γjik , we can verify that


⎥ ⎫ ⎥ ⎫
∂  l ∂  l
Ricij − Ricji = i Γjl (x) − j Γil (x) (1.12)
∂x ∂x
l l

= k
Rkij .
k

One immediately sees that the existence of a function Ω satisfying (1.10) is equivalent
to the right side of (1.12) to be identically zero.
Making use of (1.10), it is easy to show that the parallel volume form of a
Levi-Civita connection Γ⎧ is given by

⎧ =
Ω(x) det[gij (x)].

Making use of (1.7), the parallel volume forms Ω, Ω ∗ associated with Γ and Γ ∗
satisfy (apart from a multiplicative constant which must be positive)


Ω(x) Ω ∗ (x) = (Ω(x)) 2
= det[gij (x)]. (1.13)

The equiaffine condition can also be expressed using a quantity related to the
cubic form Cijk . We may introduce the Tchebychev form (also known as the first
Koszul form), expressed in the local coordinates,

Ti = Cijk g jk . (1.14)
j,k
1 Divergence Functions and Geometric Structures 9

A tedious calculation shows that


⎥ ⎫ ⎥ ⎫
∂Ti ∂Tj ∂  ∂ 
− i = j Γlil − i Γljl ,
∂x j ∂x ∂x ∂x
l l

the righthand side of (1.12). Therefore, an equivalent requirement for equiaffine


structure is that Tchebychev 1-form T is “closed”:

∂Ti ∂Tj
= . (1.15)
∂x j ∂x i
This expresses the integrability condition. When Eq. (1.15) is satisfied, there exits a
function φ such that Ti = ∂i τ . Furthermore, it can be shown that


τ = −2 log(Ω/Ω).

Proposition 1 ([13, 25]) The necessary and sufficient condition for a torsion-free
connection ∇ to be equiaffine is for any of the following to hold:
1. There exists a ∇-parallel volume element Ω : ∇Ω = 0.
2. Ricci tensor of ∇ is symmetric: Ricij = Ricji .
 k
3. Curvature tensor k Rkij = 0.
4. The Tchebychev 1-form T is closed, dT = 0.
5. There exists a function τ , called Tchebychev potential, such that Ti = ∂i τ .
It is known that the Ricci tensor of the Levi-Civita connection is always
⎧ always exists.
symmetric—this is why Riemannian volume form Ω

1.2.4 α-Structure and α-Hessian Structure

On a statistical manifold, one can define a one-parameter family of affine connections


Γ (α) , called “α-connections” (α ∈ R):

1 + α k 1 − α ∗k
Γij(α)k = Γij + Γij . (1.16)
2 2

Obviously, Γ (0) = Γ⎧ is the Levi-Civita connection. Using cubic form, this amounts
to ∇ (α) g = αC. The α-parallel volume element is given by:
α

Ω (α) = e− 2 τ Ω

where τ is the Tchebychev potential. The Riemannian volume element Ω ⎧ is only


ˆ ˆ ⎧
parallel with respect to the Levi-Civita connection ∇ of g, that is, ∇ Ω = 0, but not
other α-connections (α = 0). Rather, ∇ (α) Ω (α) = 0.
10 J. Zhang

∗ for the pair of conjugate


It can be further shown that the curvatures Rlkij , Rlkij
connections Γ, Γ ∗ satisfy

Rlkij = Rlkij .

So, Γ is flat if and only if Γ ∗ is flat. In this case, the manifold is said to be “dually
flat”. When Γ, Γ ∗ are dually flat, then Γ (α) is called “α-transitively flat” [21]. In
such case, {M, g, Γ (α) , Γ (−α) } is called an α-Hessian structure [26]. They are all
compatible with a metric g that is induced from a strictly convex (potential) function,
see next subsection.
For an α-Hessian manifold, the Tchebychev form (1.14) is given by

∂ log(det[gkl ])
Ti =
∂x i
and its derivative (known as the second Koszul form) is

∂Ti ∂ 2 log(det[gkl ])
βij = = .
∂x j ∂x i ∂x j

1.2.5 Biorthogonal Coordinates

A key feature for α-Hessian manifolds is biorthogonal coordinates, as we shall dis-


cuss now. They are the “best” coordinates one can have when the Riemannian metric
is non-trivial.
Consider coordinate transform x ◦→ u,

∂  ∂x l ∂ 
∂i ≡ = = F li ∂l
∂ui ∂ui ∂x l
l l

where the Jacobian matrix F is given by

∂ui ∂x i 
Fij (x) = , F ij
(u) = , Fil F lj = δil (1.17)
∂x j ∂uj
l

j
where δi is Kronecker delta (taking the value of 1 when i = j and 0 otherwise). If the
new coordinate system u = [u1 , . . . , un ] (with components expressed by subscripts)
is such that
Fij (x) = gij (x), (1.18)

then the x-coordinate system and the u-coordinate system are said to be “biorthogo-
nal” to each other since, from the definition of metric tensor (1.1),
1 Divergence Functions and Geometric Structures 11
   j
g(∂i , ∂ j ) = g(∂i , F lj ∂l ) = F lj g(∂i , ∂l ) = F lj gil = δi .
l l l

In such case, denote


g ij (u) = g(∂ i , ∂ j ), (1.19)

which equals F ij , the Jacobian of the inverse coordinate transform u ◦→ x. Also


introduce the (contravariant version) of the affine connection Γ under u-coordinate
and denote it by an unconventional notation Γtrs defined by

∇∂ r ∂ s = Γtrs ∂ t ;
t

similarly Γt∗rs is defined via



∇∂∗r ∂ s = Γt∗rs ∂ t .
t

The covariant version of the affine connections will be denoted by superscripted Γ


and Γ ∗
Γ ij,k (u) = g(∇∂ i ∂ j , ∂ k ), Γ ∗ij,k (u) = g(∇∂∗i ∂ j , ∂ k ). (1.20)

The affine connections in u-coordinates (expressed in superscript) and in x-coordinates


(expressed in subscript) are related via
⎛ ⎞
  ∂x r ∂x s ∂ ⎠ ∂uk
2 xk
Γtrs (u) = ⎝ Γijk (x) + (1.21)
∂ui ∂uj ∂ur ∂us ∂x t
k i,j

and
 ∂x r ∂x s ∂x t ∂ 2 xt
Γ rs,t (u) = Γij,k (x) + . (1.22)
∂ui ∂uj ∂uk ∂ur ∂us
i,j,k

Similarly relations hold between Γt∗rs (u) and Γij∗k (x), and between Γ ∗rs,t (u) and
∗ (x).
Γij,k
In analogous to (1.7), we have the following identity

∂ 2 xt ∂g rt (u)
= = Γ rs,t (u) + Γ ∗ts,r (u).
∂us ∂ur ∂us

Therefore, we have

Proposition 2 Under biorthogonal coordinates, a pair of conjugate connections


Γ, Γ ∗ satisfy
12 J. Zhang

Γ ∗ts,r (u) = − g ir (u)g js (u)g kt (u)Γij,k (x) (1.23)
i,j,k

and 
Γr∗ ts (u) = − g js (u)Γjrt (x). (1.24)
j

Let us now express parallel volume forms Ω(x), Ω(u) under biorthogonal coor-
dinates x or u. Contracting the indices t with r in (1.24), and invoking (1.10), we
obtain

∂ log Ω ∗ (u)  ∂x j ∂ log Ω(x) ∂ log Ω ∗ (u) ∂ log Ω(x)


+ = + = 0.
∂us ∂us ∂x j ∂us ∂us
j

After integration,
Ω ∗ (u) Ω(x) = const. (1.25)

From (1.13) and (1.25),


Ω(u) Ω ∗ (x) = const. (1.26)

The relations (1.25) and (1.26) indicate that the volume forms of the pair of conjugate
connections, when expressed in biorthogonal coordinates respectively, are inversely
proportional to each other.
The Γ (α) -parallel volume element Ω (α) can be shown to be given by (in either x
and u coordinates)
1+α 1−α
Ω (α) = Ω 2 (Ω ∗ ) 2 .

Clearly,

Ω (α) (x)Ω (−α) (x) = det[gij (x)] ←→ Ω (α) (u)Ω (−α) (u) = det[g ij (u)].

1.2.6 Existence of Biorthogonal Coordinates


From its definition (1.18), we can easily show that

Proposition 3 A Riemannian manifold with metric gij admits biorthogonal


∂gij
coordinates if and only if ∂x k
is totally symmetric

∂gij (x) ∂gik (x)


= . (1.27)
∂x k ∂x j
That (1.27) is satisfied for biorthogonal coordinates is evident by virtue of (1.17)
and (1.18). Conversely, given (1.27), there must be n functions ui (x), i = 1, 2, . . . , n
1 Divergence Functions and Geometric Structures 13

such that
∂ui (x) ∂uj (x)
= gij (x) = gji (x) = .
∂x j ∂x i

The above identity implies that there exist a function Φ such that ui = ∂i Φ and, by
positive definiteness of gij , Φ would have to be a strictly convex function! In this
case, the x- and u-variables satisfy (1.37), and the pair of convex functions, Φ and
 are related to gij and g ij by
its conjugate Φ,

∂ 2 Φ(x) 
∂ 2 Φ(u)
gij (x) = ←→ g ij
(u) = .
∂x i ∂x j ∂ui ∂uj

It follows from the above Lemma that a necessary and sufficient condition for
a Riemannian manifold to admit biorthogonal coordinates it that its Levi-Civita
connection is given by
⎨ ⎩
1 ∂gik ∂gjk ∂gij 1 ∂gij
Γ⎧ij,k (x) ≡ + − k = .
2 ∂x j ∂x i ∂x 2 ∂x k

From this, the following can be shown:

Proposition 4 A Riemannian manifold {M, g} admits a pair of biorthogonal coor-


dinates x and u if and only if there exists a pair of conjugate connections γ and γ ∗
such that γij,k (x) = 0, γ ∗rs,t (u) = 0.

In other words, biorthogonal coordinates are affine coordinates for the dually-flat
pair of connections. In fact, we can now define a pair of torsion-free connections by

∗ ∂gij
γij,k (x) = 0, γij,k (x) =
∂x k
and show that they are conjugate with respect to g, that is, they satisfy (1.6). This is
to say that we select an affine connection γ such that x is its affine coordinate. From
(1.22), when γ ∗ is expressed in u-coordinates,

 ∂x k ∂gij (x) ∂g ts (u)


γ ∗rs,t (u) = g ir (u)g js (u) +
∂ut ∂x k ∂ur
i,j,k
 ⎨ ⎩
∂g js (u) ∂g ts (u)
= g ir (u) − gij (x) +
∂ut ∂ur
i,j
 ∂g js (u) ∂g ts (u)
=− δjr + = 0.
∂ut ∂ur
j

This implies that u is an affine coordinate system with respect to γ ∗ . Therefore,


biorthogonal coordinates are affine coordinates for a pair of dually-flat connections.
14 J. Zhang

1.2.7 Symplectic, Complex, and Kähler Structures

Symplectic structure on a manifold refers to the existence of a closed, non-degenerate


2-tensor, i.e., a skew-symmetric bilinear map ω: W × W → R, with ω(X, Y ) =
−ω(Y , X) for all X, Y ∈ W ⊆ R2n . For ω to be well-defined, the vector space
W is required to be orientable and even-dimensional. In this case, there exists a
base {e1 , . . . , en , f1 , . . . , fn } of W , dim(W ) = 2n such that ω(ei , ej ) = 0, ω(fi , fj ) =
0, ω(ei , fj ) = δij for all indices i, j taking values in 1, . . . , n.
Symplectic structure is closely related to inner-product structure (the existence of
a positive-definite symmetric bilinear map G: W × W → R) and complex structure
(linear mapping J: W → W such that J 2 = −Id) on an even-dimensional vector
space W . The complex structure J on W is said to be compatible with a symplectic
structure ω if ω(JX, JY ) = ω(X, Y ) (symplectomorphism condition) and ω(X, JY ) >
0 (taming condition) for any X, Y ∈ W . With ω, J given, G(X, Y ) ≡ ω(X, JY ) can
be shown to be symmetric and positive-definite, and hence an inner-product on W .
The cotangent bundle T ∗ M of any manifold M admits a canonical symplectic
form written as
n
ω= dx i ∧ dpi ,
i=1

where (x 1 , . . . , x n , p1 , . . . , pn ) are coordinates of T ∗ M. That ω is closed can be


shown by the existence of the tautological (or Liouville) 1-form


n
α= pi dx i
i=1

(which can be checked to be coordinate-independent on T ∗ M) and then verifying


i = ∂/∂pi
ω = −dα. Hence, ω is also coordinate-independent. Denote ∂i = ∂/∂x i , ∂
as the base of the tangent bundle T M, then

i , ∂
ω(∂i , ∂j ) = ω(∂ j ) = 0; ω(∂i , ∂
j ) = −ω(∂
j , ∂i ) = ωij . (1.28)

That is, when viewed as 2 × 2 blocks of n × n matrix, ω vanishes on diagonal blocks


and has non-zero entries ωij and −ωij only on off-diagonal blocks.
The aforementioned linear map J of the tangent space Tx M  W at any point
x∈M
j , J ∂
J : J∂i = ∂ j = −∂i ,

gives rise to an “almost complex structure” on Tx M. For T M to be complex, that


is, admitting complex coordinates, an integrable condition needs to be imposed for
the J-maps at various base points x of M, and hence at various tangent spaces Tx M,
to be “compatible” with one another. The condition is that the so-called Nijenhuis
tensor N
1 Divergence Functions and Geometric Structures 15

N(X, Y ) = [JX, JY ] − J[X, JY ] − J[JX, Y ] − [X, Y ]

must vanish for arbitrary tangent vector fields X, Y .


The Riemannian metric tensor G on T M compatible with ω has the form

Gij∅ ≡ G(∂i , ∂˜j ) = 0;


Gi∅ j ≡ G(∂˜i , ∂j ) = 0;
Gij ≡ G(∂i , ∂j ) = gij
Gi∅ j∅ ≡ G(∂˜i , ∂˜j ) = gij .

where i∅ = n + i, j∅ = n + j and i, j takes values in 1, . . . , n. When viewed as


2 × 2 blocks of n × n matrix, G vanishes on the off-diagonal blocks and has non-zero
entries gij only on the two diagonal blocks. Such a metric on T M is in the form of
Sasaki metric, which can also result from an appropriate “lift” of the Riemannian
metric on M into T M, via an affine connection on T M which induces a splitting
of T T M, the tangent bundle of T M as the base manifold. We omit the technical
details here, but refer interested readers to Yano and Ishihara [22] and, in the context
of statistical manifold, to Matsuzoe and Inoguchi [12].
It is a basic conclusion in symplectic geometry that for any symplectic form, there
exists a compatible almost complex structure J. Along with the Riemannian metric,
the three structures (G, ω, J) are said to form a compatible triple if any two gives
rise to the third one. When a manifold has a compatible triple (G, ω, J) in which
J is integrable, it is called a Kähler manifold. On a Kähler manifold, using complex
coordinates, the metric G̃ associated with the complex line-element

ds2 = G̃ij dzi dz̄j ,

is given by
∂2Φ
G̃ij (z, z̄) = .
∂zi ∂z̄j

Here the real-valued function Φ (of complex variables) is called the “Kähler poten-
tial”.
It is known that the tangent bundle T M of a manifold M with a flat connection
on it admits a complex structure [7]. As [18] pointed out, Hessian manifold can be
seen as the “real” Kähler manifold.

Proposition 5 ([7]) (M, g, ∇) is a Hessian manifold if and only if (T M, J, G) is a


Kähler manifold, where G is the Sasaki lift of g.
16 J. Zhang

1.3 Divergence Functions and Induced Structures

1.3.1 Statistical Structure Induced on TM

A divergence function D: M × M → R≥0 on a manifold M under a local chart


V ⊆ Rn is defined as a smooth function (differentiable up to third order) which
satisfies
(i) D(x, y) ≥ 0 ∀x, y ∈ V with equality holding if and only if x = y;
(ii) Di (x, x) = D,j (x, x) = 0, ∀i, j ∈ {1, 2, . . . , n};
(iii) −Di,j (x, x) is positive definite.
Here Di (x, y) = ∂xi D(x, y), D,i (x, y) = ∂yi D(x, y) denote partial derivatives with
respect to the i-th component of point x and of point y, respectively, Di,j (x, y) =
∂xi ∂yj D(x, y) the second-order mixed derivative, etc.
On a manifold, divergence functions act as “pseudo-distance” functions that are
non-negative but need not be symmetric. That dualistic Riemannian manifold struc-
ture (i.e., statistical structure) can be induced from a divergence function was first
demonstrated by S. Eguchi.

Proposition 6 ([8, 9]) A divergence function D induces a Riemannian metric g and


a pair of torsion-free conjugate connections Γ, Γ ∗ given as

gij (x) = − Di,j (x, y)⎭x=y ;

Γij,k (x) = − Dij,k (x, y)⎭x=y ;


Γij,k (x) = − Dk,ij (x, y)⎭x=y .

∗ as given above are torsion-free3 and satisfy


It is easily verifiable that Γij,k , Γij,k
the conjugacy condition with respect to the induced metric gij . Hence {M, g, Γ, Γ ∗ }
as induced is a “statistical manifold’ [10].
A natural question is whether/how the statistical structures induced from different
divergence functions are related. The following is known:

Proposition 7 ([14]) Let D be a divergence function and ψ, φ be two arbitrary


functions. If D∅ (x, y) = eψ(x)+φ(y) D(x, y), then D∅ (x, y) is also a divergence function,
and the induced (M, g ∅ , Γ ∅ ) and (M, g, Γ ) induced from D(x, y) are conformally-
projectively equivalent. In particular, when φ(x) = const, then Γ ∅ and Γ are
projectively equivalent; when ψ(y) = const, then Γ ∅ and Γ are dual-projectively
equivalent.

3Conjugate connections which admit torsion has been recently studied by Calin et al. [5] and
Matsuzoe [15].
1 Divergence Functions and Geometric Structures 17

1.3.2 Symplectic Structure Induced on M × M

A divergence function D is given as a bi-variable function on M (of dimension n). We


now view it as a (single-variable) function on M×M (of dimension 2n) that assumes
zero value along the diagonal ΔM ⊂ M × M. In this subsection, we investigate the
condition under which a divergence function can serve as a “generating function”
of a symplectic structure on M × M. A compatible metric on M × M will also be
derived.
First, we fix a particular y or a particular x in M × M—this results in two
n-dimensional submanifolds of M × M that will be denoted, respectively, Mx  M
(with y point fixed) and My  M (with x point fixed). Let us write out the canonical
symplectic form ωx on the cotangent bundle T ∗ Mx given by

ωx = dx i ∧ dξ i .

Given D, we define a map LD from M × M → T ∗ Mx , (x, y) ◦→ (x, ξ) given by

LD : (x, y) ◦→ (x, Di (x, y)dx i ).

(Recall that the comma separates the variable being in the first slot versus the second
slot for differentiation.) It is easy to check that in a neighborhood of the diagonal
ΔM ⊂ M × M, the map LD is a diffeomorphism since the Jacobian matrix of the
map ⎨ ⎩
δij Dij
0 Di,j

is non-degenerate in such a neighborhood of the diagonal ΔM.


We calculate the pullback of this symplectic form (defined on T ∗ Mx ) to M × M:
∗ ∗
LD ωx = LD (dx i ∧ dξ i ) = dx i ∧ dDi (x, y)
= dx i ∧ (Dij (x, y)dx j + Di,j dyj ) = Di,j (x, y)dx i ∧ dyj .

(Here Dij dx i ∧ dx j = 0 since Dij (x, y) = Dji (x, y) always holds.)


Similarly, we consider the canonical symplectic form ωy = dyi ∧ dη i on My and
define a map RD from M × M → T ∗ My , (x, y) ◦→ (y, η) given by

RD : (x, y) ◦→ (y, D,i (x, y)dyi ).

Using RD to pullback ωy to M × M yields an analogous formula:


RD ωy = −Di,j (x, y)dx i ∧ dyj .
18 J. Zhang

Therefore, based on canonical symplectic forms on T ∗ Mx and T ∗ My , we


obtained the same symplectic form on M × M

ωD (x, y) = −Di,j (x, y)dx i ∧ dyj . (1.29)

Proposition 8 A divergence function D induces a symplectic form ωD (1.29) on


M × M which is the pullback of the canonical symplectic forms ωx and ωy by the
maps LD and RD
∗ ∗
LD ωy = Di,j (x, y)dx i ∧ dyj = −RD ωx (1.30)

It was Barndorff-Nielsen and Jupp [3] who first proposed (1.29) as an induced
symplectic form on M × M, apart from a minus sign; the divergence function D was
 Bregman divergence BΦ (given by (1.33) below)
called a “york”. As an example,
induces the symplectic form Φij dx i ∧ dyj .

1.3.3 Almost Complex Structure and Hermite Metric on M × M

An almost complex structure J on M×M is defined by a vector bundle isomorphism


(from T (M × M) to itself), with the property that J 2 = −Id. Requiring J to be
compatible with ωD , that is,

ωD (JX, JY ) = ωD (X, Y ), ∀X, Y ∈ T(x,y) (M × M),

we may obtain a constraint on the divergence function D. From


⎨ ⎩ ⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂
ωD , = ωD J i , J j = ωD , − = ω D , ,
∂x i ∂yj ∂x ∂y ∂yi ∂x j ∂x j ∂yi

we require
Di,j = Dj,i , (1.31)

or explicitly
∂2D ∂2D
= j i.
∂x ∂y
i j ∂x ∂y

Note that this condition is always satisfied on ΔM, by the definition of a diver-
gence function D, which has allowed us to define a Riemannian structure on ΔM
(Proposition 6). We now require it to be satisfied on M × M (at least a neighborhood
of ΔM).
For divergence functions satisfying (1.31), we can consider inducing a metric GD
on M × M—the induced Riemannian (Hermit) metric GD is defined by

GD (X, Y ) = ωD (X, JY ).
1 Divergence Functions and Geometric Structures 19

It is easy to verify GD is invariant under the almost complex structure J. The metric
components are given by:
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gij = GD , = ωD ,J = ωD , = −Di,j ,
∂x i ∂x j ∂x i ∂x j ∂x i ∂yj
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gi∅ j∅ = gD , j = ωD ,J j = ωD ,− j = −Dj,i ,
∂y ∂y
i ∂y i ∂y ∂y i ∂x
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gij∅ = gD , j = ωD ,J j = ωD ,− j = 0.
∂x ∂y
i ∂x i ∂y ∂x i ∂x
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gi∅ j = gD , j = ωD ,J j = ωD ,− j = 0.
∂y ∂x
i ∂y i ∂x ∂y i ∂y

So the desired Riemannian metric on M × M is


 
GD = −Di,j dx i dx j + dyi dyj .

So for GD to be a Riemannian metric, we require −Di,j to be positive-definite.


We call a divergence function D proper if and only if −Di,j is symmetric and
positive-definite on M × M. Just as any divergence function induces a Riemannian
structure on the diagonal manifold ΔM of M × M, any proper divergence function
induces a Riemannian structure on M × M that is compatible with the symplectic
structure ωD on it.

1.3.4 Complexification and Kähler Structure on M × M

We now discuss possible existence of a Kähler structure on the product manifold


M × M. By definition,

ds2 = GD − −1ωD
  √  
= −Di,j dx i ⊗ dx j + dyi ⊗ dyj + −1Di,j dx i ⊗ dyj − dyi ⊗ dx j
 √   √ 
= −Di,j dx i + −1dyi ⊗ dx j − −1dyj = −Di,j dzi ⊗ dz̄j .


Now introduce complex coordinates z = x + −1y,
⎨ ⎩
z + z̄ z − z̄
D(x, y) = D , √ ≡⎧
D(z, z̄),
2 2 −1
20 J. Zhang

so
∂2D 1 1 ∂2D ⎧
= (Dij + D,ij ) = .
∂z ∂z̄
i j 4 2 ∂z ∂z̄j
i

If D satisfies
Dij + D,ij = κDi,j (1.32)

⎧ is a
where κ is a constant, then M × M admits a Kähler potential (and hence D
Kähler manifold)
κ ∂2D ⎧
ds2 = dzi ⊗ dz̄j .
2 ∂z ∂z̄j
i

1.3.5 Canonical Divergence for Hessian Manifold

On dually flat (i.e., Hessian) manifold, there is a canonical divergence as shown


below. Recall that the Hessian metric

∂ 2 Φ(x)
gij (x) =
∂x i ∂x j
and the dual connections

∗ ∂ 3 Φ(x)
Γij,k (x) = 0, Γij,k (x) =
∂x i ∂x j ∂x k
are induced from a convex potential function Φ. In the (biorthogonal) u-coordinates,
these geometric quantities can be expressed as


∂ 2 Φ(u) 
∂ 3 Φ(u)
g ij (u) = , Γ ∗ ij,k (u) = 0, Γ ij,k (u) = ,
∂ui ∂uj ∂ui ∂uj ∂uk

where Φ is the convex conjugate of Φ.


Integrating the Hessian structure reveals the so-called Bregman divergence
BΦ (x, y) [4] as the generating function:

BΦ (x, y) = Φ(x) − Φ(y) − ≤x − y, ∂Φ(y) (1.33)

where ∂Φ = [∂1 Φ, . . . , ∂n Φ] with ∂i ≡ ∂/∂x i denotes the gradient valued in


the co-vector space  Rn , and ≤·, · n denotes the canonical pairing of a point/vector
x = [x , . . . , x ] ∈ Rn and a point/co-vector u = [u1 , . . . , un ] ∈ 
1 n Rn (dual to Rn ):


n
≤x, u n = x i ui . (1.34)
i=1
1 Divergence Functions and Geometric Structures 21

(Where there is no danger of confusion, the subscript n in ≤·, · n is often omitted.)


A basic fact in convex analysis is that the necessary and sufficient condition for a
smooth function Φ to be strictly convex is

BΦ (x, y) > 0 (1.35)

for x = y.
 
Recall that, when Φ is convex, its convex conjugate Φ: V ⊆
Rn → R is defined
through the Legendre transform:

 = ≤(∂Φ)−1 (u), u − Φ((∂Φ)−1 (u)),


Φ(u) (1.36)


 = Φ and (∂Φ) = (∂ Φ)
with Φ  −1 . The function Φ
 is also convex, and through
which (1.35) precisely expresses the Fenchel inequality

 − ≤x, u ≥ 0
Φ(x) + Φ(u)

for any x ∈ V , u ∈ 
V , with equality holding if and only if

u = (∂Φ)(x) = (∂ Φ) 
 −1 (x) ←→ x = (∂ Φ)(u) = (∂Φ)−1 (u), (1.37)

or, in component form,


∂Φ 
∂Φ
ui = ←→ x i = . (1.38)
∂x i ∂ui

With the aid of conjugate variables, we can introduce the “canonical divergence”
AΦ : V × 
V → R+ (and AΦ:  V × V → R+ )

 − ≤x, v = AΦ(v, x).


AΦ (x, v) = Φ(x) + Φ(v)

They are related to the Bregman divergence (1.33) via


BΦ (x, (∂Φ)−1 (v)) = AΦ (x, v) = BΦ((∂ Φ)(x), v).

1.4 DΦ -Divergence and Its Induced Structures

In this section, we study a particular parametric family of divergence functions,


called DΦ , induced by a strictly convex function Φ, with α as the parameter. This
family was first introduced by Zhang [23], who showed that it included many familiar
families (see also [27]). The resulting geometric structures will be studied below.
22 J. Zhang

1.4.1 DΦ -Divergence Functions

Recall that, by definition, a strictly convex function Φ: V ⊆ Rn → R, x ◦→ Φ(x)


satisfies
⎨ ⎩
1−α 1+α 1−α 1+α
Φ(x) + Φ(y) − Φ x+ y >0 (1.39)
2 2 2 2

for all x = y for any |α| < 1 (the inequality sign is reversed when |α| > 1). Assume
Φ to be sufficiently smooth (differentiable up to fourth order).
Zhang [23] introduced the following family of function on V × V as indexed by
α∈R
⎨ ⎨ ⎩⎩
(α) 4 1−α 1+α 1−α 1+α
DΦ (x, y) = Φ(x) + Φ(y) − Φ x + y .
1 − α2 2 2 2 2
(1.40)
(α)
From its construction, DΦ (x, y) is non-negative for |α| < 1 due to Eq. (1.39), and
for |α| = 1 due to Eq. (1.35). For |α| > 1, assuming ( 1−α 2 x + 2 y) ∈ V , the
1+α
(α)
non-negativity of DΦ (x, y) can also be proven due to the inequality (1.39) reversing
(±1)
its sign. Furthermore, DΦ (x, y) is defined by taking limα→±1 :

(1) (−1)
DΦ (x, y) = DΦ (y, x) = BΦ (x, y),
(−1) (1)
DΦ (x, y) = DΦ (y, x) = BΦ (y, x).

(α)
Note that DΦ (x, y) satisfies the relation (called “referential duality” in [24])

(α) (−α)
DΦ (x, y) = DΦ (y, x),

that is, exchanging the asymmetric status of the two points (in the directed distance)
amounts to α ↔ −α.

1.4.2 Induced α-Hessian Structure on TM

We start by reviewing a main result from [23] linking the divergence function
(α)
DΦ (x, y) defined in (1.40) and the α-Hessian structure.
Proposition 9 ([23]) The manifold {M, g(x), Γ (α) (x), Γ (−α) (x)}4 associated with
(α)
DΦ (x, y) is given by
gij (x) = Φij (1.41)

4 The functional argument of x (or u-below) indicates that x-coordinate system (or u-coordinate

system) is being used. Recall from Sect. 1.2.5 that under x (u, resp) local coordinates, g and Γ , in
component forms, are expressed by lower (upper, resp) indices.
1 Divergence Functions and Geometric Structures 23

and
(α) 1−α ∗(α) 1+α
Γij,k (x) = Φijk , Γij,k (x) = Φijk . (1.42)
2 2
Here, Φij , Φijk denote, respectively, second and third partial derivatives of Φ(x)

∂ 2 Φ(x) ∂ 3 Φ(x)
Φij = , Φ ijk = .
∂x i ∂x j ∂x i ∂x j ∂x k
Recall that an α-Hessian manifold is equipped with an α-independent metric and
a family of α-transitively flat connections Γ (α) (i.e., Γ (α) satisfying (1.16) and Γ (±1)
are dually flat). From (1.42),
∗(α) (−α)
Γij,k = Γij,k ,

with the Levi-Civita connection given as:

1
Γ⎧ij,k (x) = Φijk .
2
Straightforward calculation shows that:

Proposition 10 ([26]) For α-Hessian manifold {M, g(x), Γ (α) (x), Γ (−α) (x)},
(i) the Riemann curvature tensor of the α-connection is given by:

(α) 1 − α2  ∗(α)
Rμνij (x) = (Φilν Φjkμ − Φilμ Φjkν )Ψ lk = Rijμν (x),
4
l,k

with Ψ ij being the matrix inverse of Φij ;


(ii) all α-connections are equiaffine, with the α-parallel volume forms (i.e., the
volume forms that are parallel under α-connections) given by
1−α
ω (α) (x) = det[Φij (x)] 2 .

It is worth pointing out that while DΦ -divergence induces the α-Hessian structure,
it is not unique, as the same structure can arise from the following divergence function,
which is a mixture of Bregman divergences in conjugate forms:

1−α 1+α
BΦ (x, y) + BΦ (y, x).
2 2
24 J. Zhang

1.4.3 The Family of α-Geodesics

The family of auto-parallel curves on α-Hessian manifold have analytic expression.


From
d 2 x i  i(α) dx j dx k
+ Γjk =0
ds2 ds ds
j,k

and substituting (1.42), we obtain

 ⎨ ⎩
d 2 xi 1−α  dx j dx k d2 1−α
Φki 2 + Φkij = 0 ←→ 2 Φk x = 0.
ds 2 ds ds ds 2
i i,l

So the auto-parallel curves of an α-Hessian manifold all have the form


⎨ ⎩
1−α
Φk x = ak s(α) + bk
2

where the scalar s is the arc length and ak , bk , k = 1, 2 . . . , n are constant vectors
(determined by a point and the direction along which the auto-parallel curve flows
through). For α = −1, the auto-parallel curves are given by uk = Φk (x) = ak s + bk
are affine coordinates as previously noted.

1.4.3.1 Related Divergences and Geometries

Note that the metric and conjugated connections in the forms (1.41) and (1.42) are
induced from (1.40). Using the convex conjugate Φ:  V → R given by (1.36), we
(α) (x, y) defined by
introduce the following family of divergence functions D 
Φ

(α) (x, y) ≡ D(α) ((∂Φ)(x), (∂Φ)(y)).


D 
Φ 
Φ

Explicitly written, this new family of divergence functions is



(α) (x, y) = 4 1−α 1+α
D 
Φ(∂Φ(x)) + 
Φ(∂Φ(y))

Φ 1 − α2 2 2
⎨ ⎩⎩
1−α 1+α

−Φ ∂Φ(x) + ∂Φ(y) .
2 2

(α)
 (x, y) induces the α-Hessian structure
Straightforward calculation shows that D 
Φ
{M, g, Γ (−α) (α)
, Γ } where Γ (∓α) are given by (1.42); that is, the pair of
α-connections are themselves “conjugate” (in the sense of α ↔ −α) to those induced
(α)
by DΦ (x, y).
1 Divergence Functions and Geometric Structures 25

If, instead of choosing x = [x 1 , . . . , x n ] as the local coordinates for the manifold


M, we use its biorthogonal counterpart u = [u1 , . . . , un ] related to x via (1.38) to
(α)
index points on M. Under this u-coordinate system, the divergence function DΦ
between the same two points on M becomes

(α) (u, v) ≡ D(α) ((∂ Φ)(u),


D  
(∂ Φ)(v)).
Φ Φ

Explicitly written,

(α) (u, v) = 4 1−α 1+α
D Φ Φ((∂Φ)−1 (u)) + Φ((∂Φ)−1 (v))
1 − α2 2 2
⎨ ⎩⎩
1−α 1+α
−Φ (∂Φ)−1 (u) + (∂Φ)−1 (v) .
2 2

Proposition 11 ([23]) The α-Hessian manifold {M, g(u), Γ (α) (u), Γ (−α) (u)}
(α) (u, v) is given by
associated with D Φ

ij (u),
g ij (u) = Φ (1.43)

1 + α ijk 1−α
Γ (α)ij,k (u) =  , Γ ∗(α)ij,k (u) =
Φ ijk ,
Φ (1.44)
2 2
ij , Φ
Here, Φ ijk denote, respectively, second and third partial derivatives of Φ(u)



∂ 2 Φ(u) 
∂ 3 Φ(u)
ij (u) =
Φ ijk (u) =
, Φ .
∂ui ∂uj ∂ui ∂uj ∂uk

We remark that the same metric (1.43) and the same α-connections (1.44) are
(−α) (α)
induced by DΦ (u, v) ≡ DΦ (v, u)—this follows as a simple application of Eguchi
relation.
An application of (1.23) gives rise to the following relations:
 (−α)
Γ (α)mn,l (u) = − g im (u)g jn (u)g kl (u)Γij,k (x),
i,j,k
 (α)
∗(α)mn,l
Γ (u) = − g im (u)g jn (u)g kl (u)Γij,k (x),
i,j,k
 (α)
(α)klmn
R (u) = g ik (u)g jl (u)g μm (u)g νn (u)Rijμν (x).
i,j,μ,ν

The volume form associated with Γ (α) is

ij (u)]
1+α
ω (α) (u) = det[Φ 2 .
26 J. Zhang

Table 1.1 Various


Divergence function Induced geometry
divergence functions under ⎜ 
(α)
biorthogonal coordinates x or DΦ (x, y) Φ (x), Γ (x)(α) , Γ (x)(−α)
(α) ⎜ ij 
u and their induced DΦ ((∂Φ)(x), (∂Φ)(y)) Φij (x), Γ (x)(−α) , Γ (x)(α)
geometries (α) ⎜ ij 
DΦ (u, v) Φ (u), Γ (u)(−α) , Γ (u)(α)
(α) ⎜ ij 

DΦ ((∂ Φ)(x), 
(∂ Φ)(y))  (u), Γ (u)(α) , Γ (u)(−α)
Φ

When α = ±1, D (α) (u, v), as well as D


(α) (x, y) introduced earlier, take the form
Φ 
Φ
of Bregman divergence (1.33). In this case, the manifold is dually flat, with Riemann
(±1)
curvature tensor Rijμν (u) = R(±1)klmn (x) = 0.
We summarize the relations between the convex-based divergence functions and
the geometry they generate in Table 1.1. The duality associated with α ↔ −α is
called “referential duality” whereas the duality associated with x ↔ u is called
representational duality [23, 24, 27].

1.4.4 Induced Symplectic and Kähler Structures on M × M

With respect to the DΦ -divergence (1.40), observe that


⎨ ⎩  1−α
1−α 1+α 1+α 1−α 1+α 
Φ x+ y =Φ ( + √ )z + ( ⎧(α) (z, z̄),
− √ )z̄ ≡ Φ
2 2 4 4 −1 4 4 −1
(1.45)
we have
⎨ ⎩ ⎨ ⎩ 
⎧(α)
∂2Φ 1 + α2  1 − α 1+α 1−α 1+α
= Φ ij + √ z + − √ z̄
∂zi ∂z̄j 8 4 4 −1 4 4 −1

which is symmetric in i, j. Both (1.31) and (1.32) are satisfied. The symplectic form,
under the complex coordinates, is given by
⎨ ⎩ √
1−α 1+α ⎧(α) i
4 −1 ∂ 2 Φ
ω (α) = Φij x+ dx i ∧ dyj = dz ∧ dz̄j
2 2 1 + α2 ∂zi ∂z̄j

and the line-element is given by

8 ∂2Φ ⎧(α) i
ds2 = dz ⊗ dz̄j .
1 + α2 ∂zi ∂z̄j

Proposition 12 ([28]) A smooth, strictly convex function Φ: dom(Φ) ⊂ M → R


induces a a family of Kähler structure (M, ω (α) , G(α) ) defined on dom(Φ) ×
dom(Φ) ⊂ M × M with
1 Divergence Functions and Geometric Structures 27

1. the symplectic form ω (α) is given by

(α)
ω (α) = Φij dx i ∧ dyj

which is compatible with the canonical almost complex structure J

ω (α) (JX, JY ) = ω (α) (X, Y ),

where X, Y are vector fields on domΦ × dom(Φ);


2. the Riemannian metric G(α) , compatible with J and ω (α) above, is given by Φij(α)

G(α) = Φij(α) (dx i dx j + dyi dyj );

3. the Kähler structure

(α) 8 ∂2Φ ⎧(α)


ds2(α) = Φij dzi ⊗ dz̄j = ,
1 + α2 ∂zi ∂z̄j

with the Kähler potential given by

2
⎧(α) (z, z̄).
Φ
1 + α2
 1−α ⎟
Here, Φij(α) = Φij 2 x+ 1+α
2 y .

For the diagonal manifold ΔM = {(x, x) : x ∈ M}, a basis of its tangent space
T(x,x) ΔM can be selected as

1 ∂ ∂
ei = √ ( i + i ).
2 ∂x ∂y

The Riemannian metric on the diagonal, induced from G(α) is

G(α) (ei , ej )|x=y = ≤G(α) , ei ⊗ ej


1 ∂ ∂ 1 ∂ ∂
= ≤Φkl(α) (dx k ⊗ dx l + dyk ⊗ dyl ), √ ( i + i ) ⊗ √ ( j + j )
2 ∂x ∂y 2 ∂x ∂y
= Φij(α) (x, x) = Φij (x),

where ≤α, a denotes a form α operating on a tensor field a. Therefore, restricting to


the diagonal ΔM, g (α) reduces to the Riemannian metric induced by the divergence
(α)
DΦ through the Eguchi method.
We next calculate the Levi-Civita connection Γ˜ associated with G(α) . Denote

x = yi , and that
i
28 J. Zhang
⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂
Γ˜i∅ jk ∅ = G(α) ∇ ∂ , = G(α) ∇ ∂ , ,
∂x i
∅ ∂x j ∂x k ∅ ∂yi ∂x j ∂yk

and so on. The Levi-Civita connection on M × M is


(α)
(α)
1  ∂Gik ∂Gij  1 − α (α)
(α) ∂Gjk
Γ˜ijk = + − = Φijk .
2 ∂x j ∂x i ∂x k 4
(α)
(α)
1  ∂Gik ∅ ∂Gij 
(α) ∂Gjk ∅ 1 + α (α)
Γ˜ijk ∅ = + − ∅ =− Φijk .
2 ∂x j ∂x i ∂x k 4
(α) (α)
1  ∂Gik ∅ ∂Gij∅  1 − α (α)
(α) ∂Gj∅ k ∅
Γ˜i∅ jk ∅ = Γ˜ij∅ k ∅ = + − = Φijk .
2 ∂x j∅ ∂x i ∂x k

4
(α) (α)
1  ∂Gik ∂Gij∅  1 + α (α)
(α) ∂Gj∅ k
Γ˜i∅ jk = Γ˜ij∅ k = + − = Φijk .
2 ∂x j∅ ∂x i ∂x k 4
(α) (α)
1  ∂Gi∅ k ∂Gi∅ j∅ 
(α) ∂Gj∅ k 1 − α (α)
Γ˜i∅ j∅ k = + ∅ − =− Φijk .
2 ∂x j∅ ∂x i ∂x k 4
(α) (α)
1  ∂Gi∅ k ∅ ∂Gi∅ j∅  1 + α (α)
(α) ∂Gj∅ k ∅
Γ˜i∅ j∅ k ∅ = + ∅ − = Φijk .
2 ∂x j∅ ∂x i ∂x k

4

1.5 Summary

In order to construct divergence functions in a principled way, this chapter consid-


ered the various geometric structures on the underlying manifold M induced from
a divergence function. Among the geometric structures considered are: statistical
structure (Riemannian metric with a pair of torsion-free dual connections, or by
simple construction, a family of α-connections), equiaffine structure (those connec-
tions that admit parallel volume forms), and Hessian structure (those connections
that are dually flat)—they are progressively more restrictive: while any divergence
function will induce a statistical manifold, only canonical divergence (i.e., Bregman
divergence) will induce a Hessian manifold. Lying in-between these extremes is the
equiaffine α-Hessian geometry induced from, say, the class of DΦ -divergence. The
α-Hessian structure has the advantage of the existence of biorthogonal coordinates,
induced from the convex function Φ and its conjugate; these coordinates are conve-
nient for computation. It should be noted that the above geometric structures, from
statistical to Hessian, are all induced on the tangent bundle T M of the manifold M
on which the divergence function is defined.
1 Divergence Functions and Geometric Structures 29

On the cotangent bundle T ∗ M side, a divergence function can be viewed as a


generating function for a symplectic structure on M × M that can be constructed in
a “canonical” way. This imposes a “properness” condition on divergence function,
stating that the mixed second derivatives of D(x, y) with respect to x and y must
commute. For such divergence functions, a Riemannian structure on M × M can
be constructed, which can be seen as an extension of the Riemannian structure on
ΔM ⊂ M × M. If a further condition on D is imposed, then M × M may be
complexified, so it becomes a Kähler manifold. It was shown that DΦ -divergence
[23] satisfies this Kählerian condition, in addition to itself being proper—the Kähler
potential is simply given by the real-valued convex function Φ. These properties,
along with the α-Hessian structure it induces on the tangent bundle, makes DΦ
a class of divergence functions that enjoy a special role with “nicest” geometric
properties, extending the canonical (Bregman) divergence for dually flat manifolds.
This will have implications for machine learning, convex optimization, geometric
mechanics, etc.

References

1. Amari, S.: Differential Geometric Methods in Statistics. Lecture Notes in Statistics, vol. 28.
Springer, New York (1985) (Reprinted in 1990)
2. Amari, S., Nagaoka, H.: Method of Information Geometry. AMS Monograph. Oxford Univer-
sity Press, Oxford (2000)
3. Barndorff-Nielsen, O.E., Jupp, P.E.: Yorks and symplectic structures. J. Stat. Plan. Inference
63, 133–146 (1997)
4. Bregman, L.M.: The relaxation method of finding the common point of convex sets and its
application to the solution of problems in convex programming. USSR Comput. Math. Phys.
7, 200–217 (1967)
5. Calin, O., Matsuzoe, H., Zhang. J.: Generalizations of conjugate connections. In: Sekigawa,
K., Gerdjikov, V., Dimiev, S. (eds.) Trends in Differential Geometry, Complex Analysis and
Mathematical Physics: Proceedings of the 9th International Workshop on Complex Structures
and Vector Fields, pp. 24–34. World Scientific Publishing, Singapore (2009)
6. Csiszár, I.: On topical properties of f-divergence. Studia Mathematicarum Hungarica 2, 329–
339 (1967)
7. Dombrowski, P.: On the geometry of the tangent bundle. Journal fr der reine und angewandte
Mathematik 210, 73–88 (1962)
8. Eguchi, S.: Second order efficiency of minimum contrast estimators in a curved exponential
family. Ann. Stat. 11, 793–803 (1983)
9. Eguchi, S.: Geometry of minimum contrast. Hiroshima Math. J. 22, 631–647 (1992)
10. Lauritzen, S.: Statistical manifolds. In: Amari, S., Barndorff-Nielsen, O., Kass, R., Lauritzen,
S., Rao, C.R. (eds.) Differential Geometry in Statistical Inference. IMS Lecture Notes, vol. 10,
pp. 163–216. Institute of Mathematical Statistics, Hayward (1987)
11. Matsuzoe, H.: On realization of conformally-projectively flat statistical manifolds and the
divergences. Hokkaido Math. J. 27, 409–421 (1998)
12. Matsuzoe, H., Inoguchi, J.: Statistical structures on tangent bundles. Appl. Sci. 5, 55–65 (2003)
13. Matsuzoe, H., Takeuchi, J., Amari, S.: Equiaffine structures on statistical manifolds and
Bayesian statistics. Differ. Geom. Appl. 24, 567–578 (2006)
14. Matsuzoe, H.: Computational geometry from the viewpoint of affine differential geometry.
In: Nielsen, F. (ed.) Emerging Trends in Visual Computing, pp. 103–123. Springer, Berlin,
Heidelberg (2009)
30 J. Zhang

15. Matsuzoe, M.: Statistical manifolds and affine differential geometry. Adv. Stud. Pure Math.
57, 303–321 (2010)
16. Nomizu, K., Sasaki, T.: Affine Differential Geometry—Geometry of Affine Immersions. Cam-
bridge University Press, Cambridge (1994)
17. Ohara, A., Matsuzoe, H., Amari, S.: Conformal geometry of escort probability and its appli-
cations. Mod. Phys. Lett. B 26, 1250063 (2012)
18. Shima, H.: Hessian Geometry. Shokabo, Tokyo (2001) (in Japanese)
19. Shima, H., Yagi, K.: Geometry of Hessian manifolds. Differ. Geom. Appl. 7, 277–290 (1997)
20. Simon, U.: Affine differential geometry. In: Dillen, F., Verstraelen, L. (eds.) Handbook of
Differential Geometry, vol. I, pp. 905–961. Elsevier Science, Amsterdam (2000)
21. Uohashi, K.: On α-conformal equivalence of statistical manifolds. J. Geom. 75, 179–184 (2002)
22. Yano, K., Ishihara, S.: Tangent and Cotangent Bundles: Differential Geometry, vol. 16. Dekker,
New York (1973)
23. Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16, 159–195
(2004)
24. Zhang, J.: Referential duality and representational duality on statistical manifolds. Proceedings
of the 2nd International Symposium on Information Geometry and Its Applications, Tokyo,
pp. 58–67 (2006)
25. Zhang, J.: A note on curvature of α-connections on a statistical manifold. Ann. Inst. Stat. Math.
59, 161–170 (2007)
26. Zhang, J., Matsuzoe, H.: Dualistic differential geometry associated with a convex function. In:
Gao, D.Y., Sherali, H.D. (eds.) Advances in Applied Mathematics and Global Optimization
(Dedicated to Gilbert Strang on the occasion of his 70th birthday), Advances in Mechanics and
Mathematics, vol. III, Chap. 13, pp. 439–466. Springer, New York (2009)
27. Zhang, J.: Nonparametric information geometry: From divergence function to referential-
representational biduality on statistical manifolds. Entropy 15, 5384–5418 (2013)
28. Zhang, J., Li, F.: Symplectic and Kähler structures on statistical manifolds induced from diver-
gence functions. In: Nielson, F., Barbaresco, F. (eds.) Proceedings of the 1st International
Conference on Geometric Science of Information (GSI2013), pp. 595–603 (2013)
Chapter 2
Geometry on Positive Definite Matrices
Deformed by V-Potentials and Its Submanifold
Structure

Atsumi Ohara and Shinto Eguchi

Abstract In this paper we investigate dually flat structure of the space of positive
definite matrices induced by a class of convex functions called V-potentials, from a
viewpoint of information geometry. It is proved that the geometry is invariant under
special linear group actions and naturally introduces a foliated structure. Each leaf is
proved to be a homogeneous statistical manifold with a negative constant curvature
and enjoy a special decomposition property of canonically defined divergence. As
an application to statistics, we finally give the correspondence between the obtained
geometry on the space and the one on elliptical distributions induced from a certain
Bregman divergence.

Keywords Information geometry · Divergence · Elliptical distribution · Negative


constant curvature · Affine differential geometry

2.1 Introduction

The space of positive definite matrices has been studied both geometrically and
algebraically, e.g., as a Riemannian symmetric space and a cone of squares in Jordan
algebra ([9, 12, 17, 37, 45] and many others), and the results have broad applications
to analysis, statistics, physics and convex optimization and so on.
In the development of Riemannian geometry on the space of positive definite
matrices P, the function α(P) = −k log det P with a positive constant k plays a cen-

A. Ohara (B)
Department of Electrical and Electronics Engineering,
University of Fukui, Fukui 910-8507, Japan
e-mail: [email protected]
S. Eguchi
The Institute of Statistical Mathematics, Tachikawa, Tokyo 190-0014, Japan
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 31


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_2,
© Springer International Publishing Switzerland 2014
32 A. Ohara and S. Eguchi

tral role. Important facts are that the function α is strictly convex, and that the Riem-
manian metric defined as the Hessian of α and the associated Levi-Civita connection
are invariant under automorphism group actions on the space, i.e., congruent trans-
formations by general linear group [37, 45]. The group invariance, i.e, homogeneity
of the space is also crucial in several applications such as multivariate statistical
analysis [23] and mathematical program called the semidefinite program (SDP) [28,
46].
On the other hand, the space of positive definite matrices admits dualistic geometry
(information geometry [1, 2], Hessian geometry [40]), which involves a certain pair
of two affine connections instead of the single Levi-Civita connection. Then the
geometry on the space not only maintains the group invariance but also reveals
its abundant dualistic structure, e.g., Legendre transform, dual flatness, divergences
and Pythagorean type relation and so on [34]. Such concepts and results have been
proved very natural also in study of applications, e.g., structures of stable matrices
[31], properties of means defined as midpoints of dual geodesics [30], or iteration
complexity of interior point methods for SDP [15]. In this case, the function α is
still an essential source of the structure and induces the dualistic geometry as a
potential function.
The first purpose of this paper is to investigate the geometry of the space of positive
definite matrices induced by a special class of convex functions, including α, from
the viewpoints of the dualistic geometry. We here consider the class of functions of
the form α(V ) = V (det P) with smooth functions V on positive real numbers R++ ,
and call α(V ) a V-potential [32].1 While the dualistic geometry induced from the log-
potential α = α(− log) is standard in the sense that it is invariant for the action of the
every automorphism group, the geometries induced from general V -potentials lose
the prominent property. We, however, show that they still preserve the invariance for
the special linear group actions. On the basis of this fact and by means of the affine
hypersurface theory [29], we exploit a foliated structure and discuss the common
properties of leaves as homogeneous submanifolds.
Another purpose is to show that the dualistic geometry on the space of positive
definite matrices induced from V -potentials naturally corresponds to the geometry
of fairly general class of multivariate statistical models:
For a given n×n positive definite matrix P the n-dimensional Gaussian distribution
with zero mean vector is defined by the density function

1 1 n
f (x, P) = exp{− x T Px − α(P)}, α(P) = − log det P + log 2∂,
2 2 2
which has a wide variety of applications in the field of probability theory, statistics,
physics and so forth. Information geometry successfully provides elegant geometric
insight, such as the Fisher information metric, e-connection and m-connection, with
the theory of exponential family of distributions including the Gaussian distribution
family [1, 2, 16, 25]. Note that in this geometry, the structure is derived from the

1 The reference [32] is a conference paper of this chapter omitting proofs and the whole Sect. 2.5.
2 Geometry on Positive Definite Matrices 33

so-called Kullback-Leibler divergence (relative entropy) [18] and the log-potential


α again plays a key role through the conjugate convexity.
The Gaussian distribution is naturally extended to a general non-exponential but
symmetric distribution called an elliptical distribution [8, 23, 36] with a density
function of the form
 1  dU(s)
f (x, P) = u − x T Px − cU (det P) , u(s) = ,
2 ds
where U is a convex function on an interval in R with the positive derivative u
and cU (det P) is the normalizing constant. On the family of such distributions we
can introduce a new Bregman divergence called a U-divergence, yielding geomet-
ric counterparts that define natural dually flat geometry of the family. Refer to
[4, 24, 33] for the detail of the U-divergence and its applications. Further, recent
progress as for theory and applications of generalized Gaussian distributions in sta-
tistics or statistical physics can be found in, e.g., [3, 10, 26, 27, 42].
We prove that the geometry of elliptical distributions derived by the U-divergence
coincides with the dualistic geometry on positive definite matrices induced from
the corresponding V -potential. This implies that the various geometric properties
of the family can be translated to those of dualistic geometry induced from the
corresponding V -potential.
In addition to statistical applications it should be noted that the geometry induced
from V-potentials can be applied to derive a class of quasi-Newton update formulas
in the field of numerical optimization [14].
The paper is organized as follows:
Section 2.2 provides basic results on information geometry including the idea of
statistical manifold.
In Sect. 2.3, we introduce the V -potential function derive the corresponding
Riemannian metric. The conditions for the V -potential to have the positive definite
Hessian are given.
Section 2.4 focuses on the dual affine connections induced from the V -potential
function. It is observed the particular class of potentials called the power potentials
induce GL(n, R)-invariant dual connections while the others generally do not.
In Sect. 2.5, we introduce a foliated structure, which is naturally defined by the
special linear group action, and discuss the geometric properties of its associated
leaves of constant determinants. We show that each leaf is a Riemmanian symmetric
space with respect to the induced Riemannian metric, and that the induced dual
connection is independent of the choice of V -potentials. Further, on the basis of
affine hypersurface theory, each leaf is proved to be a statistical submanifold with
constant curvature. As a consequence, we demonstrate that a new decomposition
theorem of the divergence holds for the leaf and a point outside it.
Section 2.6 gives the correspondence between two geometries induced by a
V -potential and U-divergence, as an application to statistics.
34 A. Ohara and S. Eguchi

2.2 Preliminary: Information Geometry and Statistical


Manifolds

We recall the basic notions and results in information geometry [2], i.e., dualistic
differential geometry of Hessian domain [40] and statistical manifolds [19–21].
For a torsion-free affine connection ⊂ and a pseudo-Riemannian metric g on
a manifold M, the triple (M, ⊂, g) is called a statistical (Codazzi) manifold if it
admits another torsion-free connection ⊂ ∗ satisfying

Xg(Y , Z) = g(⊂X Y , Z) + g(Y , ⊂X∗ Z) (2.1)

for arbitrary X, Y and Z in X (M), where X (M) is the set of all tangent vector fields
on M. We call ⊂ and ⊂ ∗ duals of each other with respect to g and (g, ⊂, ⊂ ∗ ) is said
dualistic structure on M.
A statistical manifold (M, ⊂, g) is said to be of constant curvature k if the cur-
vature tensor R of ⊂ satisfies

R(X, Y )Z = k{g(Y , Z)X − g(X, Z)Y }.

When the constant k is zero, the statistical manifold is said to be flat, or dually flat,
because the curvature tensor R∗ of ⊂ ∗ also vanishes automatically [2].
For τ ∈ R, statistical manifolds (M, ⊂, g) and (M, ⊂  , g ) are said to be
τ-conformally equivalent if there exists a function δ on M such that

g (X, Y ) = eδ g(X, Y )
1+τ
g(⊂X Y , Z) = g(⊂X Y , Z) − dδ(Z)g(X, Y )
2
1−τ
+ {dδ(X)g(Y , Z) + dδ(Y )g(X, Z)}.
2

Statistical manifolds (M, ⊂, g) and (M, ⊂  , g ) are τ-conformally equivalent if and


only if (M, ⊂ ∗ , g) and (M, ⊂  ∗ , g ) are −τ-conformally equivalent [20]. A statis-
tical manifold (M, ⊂, g) is called τ-conformally flat if it is locally τ-conformally
equivalent to a flat statistical manifold.
Let {x 1 , . . . , x n } be an affine coordinate system of Rn with respect to certain basis
vectors and ⊂ be the canonical flat affine connection, i.e., ⊂ψ/ψxi ψ/ψx j = 0. If the
 2
Hessian g(α) = (ψ α/ψx i ψx j )dx i dx j of a function α on a domain Ω in Rn is
non-degenerate, we call (Ω, ⊂, g(α) ) a Hessian domain, which is a manifold with
a pseudo-Riemannian metric g(α) and an affine connection ⊂. A Hessian domain
(Ω, ⊂, g(α) ) is a flat statistical manifold. Conversely, a flat statistical manifold is
locally a Hessian domain [2, 40].
We denote by ≡y∗ , x∇ the pairing of x in Rn and y∗ in the dual space denoted by Rn∗ ,
and define the gradient mapping gradα at a point p ∈ Ω by ≡gradα, X∇ = dα(X),
identifying X ∈ Tp Ω with X ∈ Rn . Since a Hessian domain (Ω, ⊂, g(α) ) possesses
2 Geometry on Positive Definite Matrices 35

non-degenerate Hessian g(α) , the gradient mapping at p = (x 1 , · · · , x n ), expressed


in the dual coordinate system {x1∗ , · · · , xn∗ }

gradα : x ◦→ x ∗ = (x1∗ , · · · , xn∗ ), xi∗ = ψα/ψx i

on local neighborhood, is invertible on its image. The conjugate function α∗ of x ∗ is


locally defined by

α∗ (x ∗ ) = ≡x ∗ , (grad’)−1 (x ∗ )∇ − α((grad’)−1 (x ∗ )).

We call the mapping from (x, α) to (x ∗ , α∗ ) the Legendre transformation. Both


functions α and α∗ are said potentials. We regard {x1∗ , · · · , xn∗ } is another coordinate
system on Ω. For a Hessian domain (Ω, ⊂, g(α) ), the dual connection for ⊂ denoted
by ∗ ⊂ (α) is characterized by the flat connection with the affine coordinate system

{x1∗ , · · · , xn∗ }. Further, the locally defined Hessian g(α ) = (ψ 2 α∗ /ψxi∗ ψxj∗ )dxi∗ dxj∗

is equal to g(α) , i.e., (Ω, ∗ ⊂ (α) , g(α) = g(α ) ) is locally a Hessian domain [2].

When the conjugate function α of α can be globally defined on Ω, we define the
canonical divergence D(α) of (Ω, ⊂, g(α) ) by

D(α) (p, q) = α(x(p)) + α∗ (x ∗ (q)) − ≡x ∗ (q), x(p)∇

for two points p and q on Ω, where x(·) and x ∗ (·) respectively represent x- and x ∗ -
coordinates of a point ·. If g(α) is positive definite and Ω is a convex domain, then

(i) D(α) (p, q) ∅ 0 and (ii) D(α) (p, q) = 0 ← p = q (2.2)

hold because the unique maximum α∗ (x ∗ (q)) of supp∈Ω {≡x ∗ (q), x(p)∇ − α(x(p))} is
attained when x(p) = (gradα)−1 (x ∗ (q)) holds.
Conversely, for a function D : M × M → R satisfying two conditions (2.2) for
any points p and q in M, define the positive semidefinite form g(D) , and two affine
connections ⊂ (D) and ∗ ⊂ (D) by

g(D) (X, Y ) = −D(X|Y ),


g(D) (⊂X(D) Y , Z) = −D(XY |Z), g(D) (∗ ⊂X(D) Y , Z) = −D(Z|XY ), (2.3)

for X, Y and Z ∈ X (M), where the notation in the right-hand sides stands for

D(X1 · · · Xn |Y1 · · · Ym )(p) = (X1 )p · · · (Xn )p (Y1 )q · · · (Ym )q D(p, q)|p=q

for Xi , Yj ∈ X (M). If g(D) is positive definite, then we call D a divergence or a


contrast function on M. The following result is used in the Sect. 2.6.

Proposition 1 ([2, 4]) If D is a divergence on M, then (M, ⊂ (D) , g(D) ) is a


statistical manifold with the dual connection ∗ ⊂ (D) .
36 A. Ohara and S. Eguchi

2.3 V-Potential Function and Riemannian Metric

Consider the vector space of n × n real symmetric matrices, denoted by Sym(n, R),
with an inner product (X|Y ) = tr(XY ) for X, Y ∈ Sym(n, R). We identify an element
Y ∗ in the dual space Sym(n, R)∗ with Y ∗ (denoted by the same symbol) in Sym(n, R)
via ≡Y ∗ , X∇ = (Y ∗ |X). Let {E1 , . . . , En(n+1)/2 } be basis matrices of Sym(n, R), and
consider the affine coordinate system {x 1 , . . . , x n(n+1)/2 } with respect to them and the
canonical flat affine connection ⊂. Denote by PD(n, R) the convex cone of positive
definite matrices in Sym(n, R).
Let X (PD(n, R)) and TP PD(n, R) be the set of all tangent vector fields on
PD(n, R) and the tangent vector space at P ∈ PD(n, R), respectively. By iden-
tifying Ei and the natural basis (ψ/ψx i )P , we represent XP ∈ TP PD(n, R) by
X ∈ Sym(n, R). Similarly we regard a Sym(n, R)-valued smooth function X on
PD(n, R) as X ∈ X (PD(n, R)) via the identification of constant function Ei
with ψ/ψx i .
Let φG denote the congruent transformation by matrices G, i.e., φG X = GXGT .
The differential of φG is denoted by φG ∗ . If G is nonsingular, the transformation φG
is an element of automorphism group that acts transitively on PD(n, R).
In the Sects. 2.3 and 2.4, we consider the dually flat structure on PD(n, R) as a
Hessian domain induced from a certain class of potential functions.

Definition 1 Let V (s) be a smooth function on positive real numbers s ∈ R++ . The
function defined by
α(V ) (P) = V (det P) (2.4)

is called a V -potential on PD(n, R).

Let βi (s), i = 1, 2, 3 be functions defined by

dβi−1 (s)
βi (s) = s, i = 1, 2, 3, where β0 (s) = V (s).
ds

We assume that V (s) satisfies the following two conditions:

β2 (s) 1
(i) β1 (s) < 0 (s > 0), (ii) γ (V ) (s) = < (s > 0), (2.5)
β1 (s) n

which are later shown to ensure the convexity of α(V ) (P) on PD(n, R). Note that
the first condition β1 (s) < 0 for all s > 0 implies the function V (s) is strictly
decreasing on s > 0. Important examples of V (s) satisfying (2.5) are − log s or
c1 + c2 sγ for real parameters c1 , c2 , γ with c2 γ < 0 and γ < 1/n. Another example
V (s) = c log(cs + 1) − log s with 0 ≤ c < 1 is proved useful for the quasi-Newton
updates [14].
Using the formula grad det P = (det P)P−1 , we have the gradient mapping
gradα(V ) and the differential form dα(V ) , respectively, as
2 Geometry on Positive Definite Matrices 37

gradα(V ) : P ◦→ P∗ = β1 (det P)P−1 , (2.6)


(V ) (V ) −1
dα : X ◦→ dα (X) = β1 (det P)tr(P X). (2.7)

(V )
The Hessian of α(V ) at P ∈ PD(n, R), which we denote by gP , is calculated as

(V )
gP (X, Y ) = d(dα(V ) (X))(Y )
= −β1 (det P)tr(P−1 XP−1 Y ) + β2 (det P)tr(P−1 X)tr(P−1 Y ).

Theorem 1 The Hessian g(V ) is positive definite on PD(n, R) if and only if the
conditions (2.5) hold.
The proof can be found in the Appendix A.
(V ) (V )
Remark 1 The Hessian g(V ) is SL(n, R)-invariant, i.e., gP (X  , Y  ) = gP (X, Y )
for any G ∈ SL(n, R), where P = φG P, X  = φG∗ X and Y  = φG∗ Y .
The conjugate function of α(V ) denoted by α(V )∗ is

α(V )∗ (P∗ ) = sup{≡P∗ , P∇ − α(V ) (P)}. (2.8)


P

Since the extremal condition is

P∗ = gradα(V ) (P) = β1 (det P)P−1

and gradα(V ) is invertible by the positive definiteness of g(V ) , we have the following
expression for α(V )∗ with respect to P:

α(V )∗ (P∗ ) = nβ1 (det P) − α(V ) (P). (2.9)

Hence the canonical divergence D(V ) of (PD(n, R), ⊂, g(V ) ) is obtained as

D(V ) (P, Q) = α(V ) (P) + α(V )∗ (Q∗ ) − ≡Q∗ , P∇


= V (det P) − V (det Q) + ≡Q∗ , Q − P∇. (2.10)

2.4 Dual Affine Connection Induced from V-Potential

Let ⊂ be the canonical flat affine connection on Sym(n, R). To discuss dually flat
structure on PD(n, R) by regarding g(V ) as positive definite Riemannian metric, we
derive the dual connection ∗ ⊂ (V ) with respect to g(V ) introduced in the Sect. 2.2.
Consider a smooth curve ω = {Pt | − ξ < t < ξ} in PD(n, R) satisfying
⎛ ⎝
dPt
(Pt )t=0 = P ∈ PD(n, R), = X ∈ TP PD(n, R).
dt t=0
38 A. Ohara and S. Eguchi

Lemma 1 The differential of the gradient mapping gradα(V ) is

(gradα(V ) )∗ : X ◦→ β2 (det P)tr(P−1 X)P−1 − β1 (det P)P−1 XP−1 .

Proof Differentiate the Legendre transform Pt∗ = gradα(V ) (Pt ) along the curve ω,
then

dPt∗ dβ1 (det Pt ) −1 dPt −1


= Pt − β1 (det Pt )Pt−1 P
dt dt dt t
d det Pt −1 dPt −1
= β1 (det Pt ) Pt − β1 (det Pt )Pt−1 P
dt ⎛ ⎝ dt t
dPt dPt −1
= β1 (det Pt ) det Pt tr Pt−1 Pt−1 − β1 (det Pt )Pt−1 P ,
dt dt t

where β1 = dβ1 /ds. By evaluating both sides at t = 0, we see the statement holds.


Theorem 2 Let ∂t denote the parallel shift operator of the connection ∗ ⊂ (V ) . Then
the parallel shift ∂t (Y ) of the tangent vector Y (= ∂0 (Y )) ∈ TP PD(n, R) along the
curve ω satisfies
⎛ ⎝
d∂t (Y )
= XP−1 Y + YP−1 X + Φ(X, Y , P) + Φ ⊆ (X, Y , P),
dt t=0

where

β2 (s)tr(P−1 X) β2 (s)tr(P−1 Y )
Φ(X, Y , P) = Y+ X, (2.11)
β1 (s) β1 (s)
Φ ⊆ (X, Y , P) = ηP, (2.12)

{β3 (s)β1 (s) − 2β22 (s)}tr(P−1 X)tr(P−1 Y ) + β2 (s)β1 (s)tr(P−1 XP−1 Y )


η=
β1 (s){β1 (s) − nβ2 (s)}

and s = det P.

The proof can be found in the Appendix B.

Remark 2 The denominator β1 (s){β1 (s) − nβ2 (s)} in η is always positive because
(2.5) is assumed.

Corollary 1 Let Ei be the matrix representation of the natural basis vector fields
ψ/ψx i described in the beginning of the Sect. 2.3. Then, their covariant derivatives at
P defined by the dual connection ∗ ⊂ (V ) have the following matrix representations:
⎛ ⎝
∗ (V ) ψ
⊂ ψ = −Ei P−1 Ej − Ej P−1 Ei − Φ(Ei , Ej , P) − Φ ⊆ (Ei , Ej , P),
ψx i ψx j P
2 Geometry on Positive Definite Matrices 39

Proof Recall the identification (ψ/ψx i )P = Ei for any P ∈ PD(n, R). Then the
statement follows from Theorem 2 and the definition of the covariant derivatives,
i.e.,
⎛ ⎝ ⎛ ⎝
∗ (V ) ψ d∂−t ((ψ/ψx j )Pt )
⊂ ψ = , Pt ∈ PD(n, R), P = P0 .
ψx i ψx j P dt t=0

This completes the proof. 

Remark 3 From Corollary 1 we observe that both of the connections ⊂ and ∗ ⊂ (V )


are generally SL(n, R)-invariant, i.e.,
⎞ ⎠    
(V ) (V )
φG∗ (⊂X Y )P = ⊂X  Y  P , φG∗ ∗ ⊂X Y = ∗ ⊂X  Y 
P P

holds for any G ∈ SL(n, R), where P = φG P, X  = φG∗ X and Y  = φG∗ Y . Partic-
ularly, both of the connections induced from the power potential α(V ) , defined via
V (s) = c1 + c2 sγ with real constants c1 , c2 and γ, are GL(n, R)-invariant. In addi-
tion, so is the orthogonality with respect to g(V ) . Hence, we conclude that both ⊂-
and ∗ ⊂ (V ) -projections [1, 2] are GL(n, R)-invariant for the power potentials, while
so is not g(V ) .
The power potential function α(V ) with normalizing conditions V (1) = 0 and
β1 (1) = −1, i.e., V (s) = (1 − sγ )/γ is called the beta potential. In this case,
β1 (s) = −sγ , β2 (s) = −γsγ , β3 (s) = −γ 2 sγ and γ (V ) (s) = γ. Note that setting γ to
zero leads to V (s) = − log s, which recovers the standard dualistic geometry induced
by the logarithmic characteristic function α(− log) (P) = − log det P on PD(n, R)
[34]. See [7, 33] for detailed discussion related with the power potential function.

2.5 Foliated Structures and Their Geometry


When V (s) = −k log s for a positive constant k, we know the geometry induced
by the corresponding V -potential function is standard in several senses [2, 34].
In particular it is invariant under the automorphism group action of φG where
G ∈ GL(n, R). On the other hand, for general V (s) satisfying (2.5), we have observed
in the previous sections that the dualistic structure (g(V ) , ⊂, ∗ ⊂ (V ) ) on PD(n, R) is
invariant under the action of φG only when G ∈ SL(n, R).
In this section we are devoted to investigate the SL(n, R)-invariant geometry on
PD(n, R) induced by V -potentials.

2.5.1 Foliated Structures

Define foliated structure of PD(n, R) specified by a real s as follows (Fig. 2.1):


40 A. Ohara and S. Eguchi

Fig. 2.1 Foliated structures


of PD(n, R)


PD(n, R) = Ls , Ls = {P|P > 0, det P = s}.
s>0

The tangent vector space of Ls at P is characterized as

TP Ls = {X|tr(P−1 X) = 0, X = X T } = {P1/2 YP1/2 |trY = 0, Y = Y T }.

Denote by RP the ray through P in the convex cone PD(n, R), i.e, RP = {Q|Q =
κP, 0 < κ ∈ R}. Another foliated structure we consider is

PD(n, R) = RP .
P∈Ls

Proposition 2 Let P be in Ls . For any functions V satisfying (2.5), each leaf Ls is


orthogonal to the ray RP at P with respect to g(V ) .

Proof The tangent vectors of RP are just P multiplied by scalars. Hence for
X ∈ TP Ls we have
(V )
gP (X, P) = −β1 (s)tr(P−1 X) + nβ2 (s)tr(P−1 X) = 0.

This completes the proof. 

Proposition 3 Every ray RP is simultaneously a ⊂- and ∗ ⊂ (V ) -geodesic for an


arbitrary function V satisfying (2.5).

Proof The statement follows from that RP and its Legendre transform {Q∗ |Q∗ =
κ∗ P∗ , 0 < κ∗ ∈ R} are, respectively, straight lines with respect to the affine coordi-
nate systems of ⊂ and ∗ ⊂ (V ) . 

We know the mean of dual connections is metric and torsion-free [2], i.e.,

1 
⊂ˆ (V ) = ⊂ + ∗ ⊂ (V ) (2.13)
2
2 Geometry on Positive Definite Matrices 41

is the Levi-Civita connection for g(V ) . Proposition 3 implies that every RP is always
⊂ˆ (V ) -geodesic for any V satisfying (2.5). On the other hand, we have the following:
Proposition 4 Each leaf Ls is autoparallel with respect to the Levi-Civita connec-
tion ⊂ˆ (V ) if and only if β2 (s) = 0, i.e, V (s) = −k log s for a positive constant k.
The proof can be found in the Appendix C.

2.5.2 Geometry of Submanifolds

Next, we further investigate the geometry of the submanifolds Ls . Let (g̃(V ) , ⊂, ˜


∗⊂˜ (V ) ) be the dualistic structure on Ls induced from (g(V ) , ⊂, ∗ ⊂ (V ) ) on PD(n, R).
Since each transformation φG where G ∈ SL(n, R) is an isometry of g̃(V ) and an
element of automorphism group acting on Ls transitively, the Riemannian manifold
(Ls , g̃(V ) ) is a homogeneous space. The isotropy group at κI consists of φG where
G ∈ SO(n). Hence Ls can be identified with SL(n, R)/SO(n). We have a slightly
extended result of Proposition 6 in [38], which treated the case of the standard
potential V (s) = − log s:
Proposition 5 Each manifold (Ls , g̃(V ) ) is a globally Riemannian symmetric space
for any V satisfying (2.5).
Proof Define the mapping νs : Ls → Ls by
 s 1 2 (−β1 (s))n
gradα(V ) (P) = s n P−1 , s∗ =
n
νs P = − ∗
.
s s

It is seen that νs is the involutive isometry of (Ls , g̃(V ) ) with the isolated fixed point
κI ∈ Ls , where κ = s1/n . In particular, −gradα(V ) (P) is the involutive isometry of
(Ls , g̃(V ) ) satisfying s = s∗ . Since isometries φG where G ∈ SL(n, R) act transitively
on Ls , each manifold (Ls , g̃(V ) ) is globally Riemannian symmetric [12].

Proposition 6 For any V (s) satisfying (2.5), the following statements hold:
(i) The metric g̃(V ) on Ls is given by g̃(V ) = −β1 (s)g̃(−log) ,
(ii) For P, Q ∈ Ls , it holds that D(V ) (P, Q) = −β1 (s)D(−log) (P, Q),
(V )
(iii) The induced connections ∗ ⊂˜ on Ls are actually independent of V (s), i.e.,
˜ (V ) = ∗ ⊂˜ (−log) for all V (s) (See the remark below).
∗⊂

The proof can be found in the Appendix D.


Remark 4 Since ⊂˜ is common, Proposition 6 demonstrates that the V -potential does
not deform the submanifold geometry of Ls except a multiplicative constant for g̃(V ) .
(− log)
Hence, we simply write ⊂˜ ∗ instead of ∗ ⊂˜ from now on.
42 A. Ohara and S. Eguchi

2.5.3 Curvature of Each Leaf and Decomposition of Divergence

In [11, 40, 43, 44] level surfaces of the potential function in a Hessian domain were
studied via the affine hypersurface theory [29]. Since the statistical submanifold
˜ g̃(V ) ) is a level surface of α(V ) in a Hessian domain (PD(n, R), ⊂, g(V ) ), the
(Ls , ⊂,
general results in the literature can be applied. The immediate consequences along
our context are summarized as follows:
Consider the gradient vector field E of α(V ) and a scaled gradient vector field N
respectively defined by

g(V ) (X, E) = dα(V ) (X), ∧X ∈ X (PD(n, R))

and
1
N =− E.
dα(V ) (E)

Straightforward calculation shows they are respectively represented at each


P ∈ PD(n, R) by

β1 (det P) 1
E= P, N = − P. (2.14)
nβ2 (det P) − β1 (det P) nβ1 (det P)

Using N as the transversal vector field, the results in [11, 43] demonstrate that
˜ g̃(V ) ) coincides with the one realized by the affine immersion
the geometry (Ls , ⊂,
[29] of Ls into Sym(n, R) with the canonical flat connection ⊂:

⊂X Y = ⊂˜ X Y + h(X, Y )N, (2.15)


⊂X N = −A(X) + φ (X)N. (2.16)

Here h and φ are, respectively, the affine fundamental form and the transversal con-
nection form satisfying the relation:

h = g̃(V ) , φ = 0, (2.17)

for all X, Y ∈ X (Ls ) at each P ∈ Ls . The affine shape operator A is determined in


the proof of Theorem 3. The condition (2.17) shows this affine immersion is non-
degenerate and equiaffine. In addition the form of N in (2.14) demonstrates that the
˜
immersion is centro-affine [29], which implies a ⊂-geodesic on Ls is the intersection
of Ls and a two-dimensional subspace in Sym(n, R).
Further, via Proposition 2 and 3, the following results are proved for the level
surface Ls of α(V ) (Proposition 8 can be also verified by direct calculations):
˜ g̃(V ) ) is 1-conformally flat.
Proposition 7 ([43]) The statistical submanifold (Ls , ⊂,
Proposition 8 ([44]) Let points P and Q be in Ls , and R be in RQ , s.t. R = κQ, 0 <
κ ∈ R. Then
2 Geometry on Positive Definite Matrices 43

D(V ) (P, R) = μD(V ) (P, Q) + D(V ) (Q, R), (2.18)

holds for μ ∈ R satisfying R∗ = μQ∗ , i.e., μ = κ−1 β1 (det R)/β1 (det Q) > 0.

Using these preliminaries, we have the main result in this section as follows:

Theorem 3 The following two statements hold:


˜ g̃(V ) ) is ±1-conformally flat.
(i) The statistical submanifold (Ls , ⊂,
(ii) The statistical submanifold (Ls , ⊂,˜ g̃(V ) ) has negative constant curvature
ks = 1/(β1 (s)n).

Proof (i) Note that Ls is a level surface of not only α(V ) (P) but also α(V )∗ (P∗ )
˜ g̃(V ) ) and (Ls , ⊂˜ ∗ , g̃(V ) ) are 1-conformally
in (2.9), which implies both (Ls , ⊂,
flat from Proposition 7. By the duality of τ-conformal equivalence, they are also
−1-conformally flat.
(ii) For arbitrary X ∈ X (Ls ), the equalities

1 1
⊂X N = − ⊂X P = − X
β1 (s)n β1 (s)n

hold at each P ∈ Ls . The second equality follows from that ⊂ is a canonical flat
connection of the vector space Sym(n, R). Comparing this equation with (2.16)
and (2.17), we have A = ks I. By substituting into the Gauss equation for the
affine immersion [29]:

R(X, Y )Z = h(Y , Z)A(X) − h(X, Z)A(Y ),

˜ g̃(V ) ) is of constant curvature ks .


we see that (Ls , ⊂, 

Remark 5 For a general statistical manifold M of dimM ∅ 3, the equivalence of


the statements (i) and (ii) in the above theorem is known [2, 19]. The theorem claims
that the equivalence holds even when n = 2, i.e, dimLs = 2

Kurose proved that a modified form of the Pythagorean relation holds for canonical
divergences on statistical manifolds with constant curvature [20]. As a consequence
of (ii) in Theorem 3, his result holds on each Ls .

Proposition 9 ([20]) Suppose that points P, Q and R be in a statistical submanifold


˜
Ls with constant curvature ks . Let ω̃ be the ⊂-geodesic joining P and Q, and ω̃ ∗ the
⊂˜ -geodesic joining Q and R. If ω̃ and ω̃ are mutually orthogonal at Q with respect
∗ ∗

to g̃(V ) , the divergence D(V ) satisfies

D(V ) (P, R) = D(V ) (P, Q) + D(V ) (Q, R) − ks D(V ) (P, Q)D(V ) (Q, R). (2.19)

˜ g̃(V ) ) in Theorem 3 ensures another relation


The −1-conformal flatness of (Ls , ⊂,
that is dual to Proposition 8:
44 A. Ohara and S. Eguchi

Lemma 2 Suppose that three points P, Q and R and the parameter κ meet the same
assumptions in Proposition 8. Then, the following relation holds:

D(V ) (R, P) = D(V ) (R, Q) + κD(V ) (Q, P). (2.20)

Proof By direct calculation the right-hand side of (2.20) can be modified as

V (det R) − V (det Q) + ≡Q∗ , Q − R∇ + κ≡P∗ , P − Q∇


= V (det R) − V (det P) + β1 (s)n − ≡Q∗ , κQ∇ + κβ1 (s)n − κ≡P∗ , Q∇
= V (det R) − V (det P) + β1 (s)n − ≡P∗ , R∇ = D(V ) (R, P). 

Remark 6 By duality the −1-conformal flatness of (Ls , ⊂, ˜ g̃(V ) ) implies 1-confor-


˜ ∗ (V )
mal flatness of (Ls , ⊂ , g̃ ). Hence, introducing dual divergence [2] defined by


D(V ) (P, Q) = D(V ) (Q, P) = α(V )∗ (P∗ ) + α(V ) (Q) − ≡P∗ , Q∇,

we see that the following dual result to Proposition 8 with respect to (Ls , ⊂˜ ∗ , g̃(V ) ):


D(V ) (P, R) = ∗ D(V ) (Q, R) + κ ∗ D(V ) (P, Q)

alternatively proves the above lemma.


Now we have the following new decomposing relation of the divergence for
PD(n, R) and its submanifold Ls (Fig. 2.2), which is an immediate consequence of
Proposition 9 and Lemma 2:
Proposition 10 Suppose P be in PD(n, R), and R and S be on Ls . Let Q be the
˜
uniquely defined point by P as Q ∈ Ls ≥ RP . If the ⊂-geodesic ω̃ joining Q and R
˜ ∗ ∗
and the ⊂ -geodesic ω̃ joining R and S are mutually orthogonal at R with respect
to g̃(V ) , then the divergence D(V ) satisfies

D(V ) (P, S) = D(V ) (P, R) + κD(V ) (R, S), κ = κ{1 − ks D(V ) (Q, R)}, (2.21)

where the constant κ > 0 is defined by Q = κP.


Proof From Lemma 2 and Proposition 9, we have the following relations:

D(V ) (P, S) = D(V ) (P, Q) + κD(V ) (Q, S),


D(V ) (P, R) = D(V ) (P, Q) + κD(V ) (Q, R),
D(V ) (Q, S) = D(V ) (Q, R) + D(V ) (R, S) − ks D(V ) (Q, R)D(V ) (R, S).

Using these equations, we see the statement holds. 


Owing to the negativity of the curvature ks , we have the following application of
the Proposition 10 on the minimality condition of D(V ) (•, •) measured from a point
P ∈ PD(n, R) to a certain submanifold in Ls .
2 Geometry on Positive Definite Matrices 45

Fig. 2.2 Decomposition of


divergences when R and S are
on a leaf Ls and ω̃ and ω̃ ∗ are
orthogonal at R

Corollary 2 Let P be a point in PD(n, R) and let N be a ⊂˜ ∗ -autoparallel subman-


ifold in Ls . A point R in N satisfies D(V ) (P, R) = minS∈N D(V ) (P, S) if and only if
˜
the ⊂-geodesic connecting Q = κP ∈ RP ≥ Ls and R is orthogonal to N at R.

Proof For arbitrary point S ∈ N it follows from Proposition 10

D(V ) (P, S) = D(V ) (P, R) + κ{1 − ks D(V ) (Q, R)}D(V ) (R, S), (2.22)

where Q and κ are uniquely defined by Q ∈ RP ≥ Ls and Q = κP.


Since 1−ks D(V ) (Q, R) and κ are positive constants respectively, D(V ) (P, S) takes
the minimum value D(V ) (P, R) if and only if D(V ) (R, S) = 0, i.e., S = R. 

Dual results for ∗ D(V ) to the above ones also hold.

2.6 Dualistic Geometries on U-Model and Positive Definite


Matrices

We explore a close relation between the dualistic geometries induced from


U-divergence and V -potential. In the field of statistical inference, the well-established
method is the maximum likelihood method, which is based on the Kullback-Leibler
divergence.
To improve robustness performance of the method maintaining its theoretical
advantages, such as efficiency, the methods of minimizing general divergences have
been recently proposed as alternatives to the maximum likelihood method, in robust
statistical analysis for pattern recognition, machine learning, principal component
analysis and so on [4, 6, 13, 22, 24, 41].
For example the beta-divergence

g(x)γ+1 − f (x)γ+1 f (x){g(x)γ − f (x)γ }
Dγ (f , g) = − dx (2.23)
γ+1 γ
46 A. Ohara and S. Eguchi

is utilized in the literature. As γ goes to 0, it reduces to the Kullback-Leibler diver-


gence; On the other hand, as γ goes to 1, it reduces to the squared L 2 -distance. Thus
the efficiency increases as γ goes to 0, while the robustness increases as γ goes to 1
[5, 39]. In this sense we could find an appropriate γ between 0 and 1 as a trade-off
between efficiency and robustness. The beta-divergence is strongly connected to the
Tsallis entropy [42].
Let us make more general discussion on divergence functionals.

Definition 2 [4] Let U(s) be a smooth convex function with the positive derivatives
u(s) = U  (s) and u (s) on R or its (semi-infinite) interval and ε be the inverse function
of u there. If the following functional for two functions f (x) and g(x) on Rn

⎩ ⎤
DU (f , g) = U(ε(g)) − U(ε(f )) − ε(g) − ε(f ) fdx

exists and converges, we call it the U-divergence.

It follows that DU (f , g) ∅ 0 and DU (f , g) = 0 if and only if f = g because the


integrand U(εg ) − [U(εf ) + u(εf )(εg − εf )], where εf = ε(f ) and εg = ε(g), is
interpreted as the difference of the convex function U and its supporting function.
While our U-divergence can be regarded as a dual expression [35] of the ordinary
Bregman divergence, the expression is proved convenient in statistical inference
from empirical data rather than the ordinary one [4, 24]. If we set U(s) = γ+1 1

(1 + γs) (γ+1)/γ , s > −1/γ, then the corresponding U-divergence is the beta-
divergence defined in (2.23).
When we consider the family of functions parametrized by elements in a man-
ifold M, the U-divergence induces the dualistic structure on M in such a way as
Proposition 1. Here, we confine our attention to the family of multivariate probability
density functions specified by P in PD(n, R). The family is natural in the sense that
it is a dually flat statistical manifold with respect to the dualistic geometry induced
by the U-divergence.

Definition 3 [4] Let U and u be the functions given in Definition 2. The family of
elliptical distributions with the following density functions
⎥ ⎛ ⎝⎫ ⎬
1 T ⎫
MU = ⎫
f (x, P) = u − x Px − cU (det P) ⎫ P ∈ PD(n, R) ,
2

is called the (zero-mean) U-model associated with the U-divergence. Here, we set
f (x, P) = 0 if the right-hand side of f is undefined, and cU (det P) is a normalizing
constant satisfying
⎨ ⎨ ⎛ ⎝
1 1
f (x, P)dx = (det P)− 2 u − yT y − cU (det P) dy = 1,
2

where we assume that the integral converges for the specified P.


2 Geometry on Positive Definite Matrices 47

For the case when the support of f is bounded, a straightforward computation


shows that
⎨ ⎛ ⎝ n ⎨ η   n
1 T ∂2 r
u − y y − cU (det P) dy = u − − cU (det P) r 2 −1 dr
2 Γ (2) 0
n
2

where η satisfies u(−η/2−cU (det P)) = 0. Hence, the normalizing constant is given
by ⎭ 1 
−1 Γ ( 2 )(det P)
n 2
cU (det P) = Γ n ,u n ,
2 ∂2

with the inverse function of Γa,u , which is defined by


⎨ −2c−2ε(0)  r 
Γa,u (c) = u − − c r a−1 dr.
0 2

A similar argument is also valid for the case of unbounded supports. See for examples
of the calculation in [33].
Note that if the function U(s) satisfies a sort of self-similarity [33], the density
function f in the U-model can be expressed in the usual form of an elliptical distri-
bution [8, 23], i.e., ⎛ ⎝
1 1 T
f (x, P) = cf (det P) u − x Px
2
2

with a constant cf .
Now we consider the correspondence between the dualistic geometry induced
by DU on the U-model and that on PD(n, R) induced by the V -potential function
discussed in the Sects. 2.3 and 2.4.

Theorem 4 Define the V -potential function α(V ) via


⎨ ⎛ ⎝
− 21 1 T
V (s) = s U − x x − cU (s) dx + cU (s), s > 0. (2.24)
2

Assume that V satisfies the conditions (2.5), then the dualistic structure (g(V ) , ⊂,
∗ ⊂ (V ) )
on PD(n, R) coincides with that on the U-model induced by the
U-divergence in such a way as Proposition 1.

The proof can be found in the Appendix E.


From the above theorem it follows that the geometry of U-model derived from U-
divergence is completely characterized by the geometry associated with V -potential
function defined via (2.24).
48 A. Ohara and S. Eguchi

2.7 Conclusion

We have studied dualistic geometry on positive definite matrices induced from the
V -potential instead of the standard characteristic function.
First, we have derived the associated Riemannian metric, mutually dual affine con-
nections and the canonical divergence. The induced geometry is, in general, proved
SL(n, R)-invariant while it is GL(n, R)-invariant in the case of the characteristic
function (V (s) = − log s). However, when V is of the power form, it is shown that
orthogonality and a pair of mutually dual connections are GL(n, R)-invariant.
Next, we have investigated a foliated structure via the induced geometry. Each
leaf (the set of positive definite matrices with a constant determinant) is proved to
be a homogeneous statistical manifold with a constant curvature depending on V .
As a consequence, we have given a new decomposition relation for the canonical
divergence that would be useful to solve the nearest point on a specified leaf.
Finally, to apply the induced geometry to robust statistical inferences we have
established a relation with geometry of U-model (or symmetric elliptical densities).
Further applications of such structures and investigation of the other form of
potential functions are left in the future work.

Acknowledgments We thank the anonymous referees for their constructive comments and careful
checks of the original manuscript.

Appendices

A Proof of Theorem 1

It is observed that −β1 (det P) ∀= 0 on PD(n, R) is necessary because the second


term is not positive definite. Hence, the Hessian can be represented as

(V )
gP (X, Y ) = −β1 (det P){tr(P−1 XP−1 Y ) − γ (V ) (det P)tr(P−1 X)tr(P−1 Y )}
 
= −β1 (det P)vecT (X̃) In2 − γ (V ) (det P)vec(In )vecT (In ) vec(Ỹ ).

Here X̃ = P−1/2 XP−1/2 , Ỹ = P−1/2 YP−1/2 , vec(•) is the operator that maps A =
2
(aij ) ∈ Rn×n to [a11 , · · · , an1 , a12 , · · · , an2 , · · · , a1n , · · · , ann ]T ∈ Rn , and In
and In2 denote the unit matrices of order n and n2 , respectively. By congruently
transforming the matrix In2 − γ (V ) (det P)vec(In )vecT (In ) with a proper permutation
matrix, we see the positive definiteness of g(V ) is equivalent with −β1 (det P) > 0
and
In − γ (V ) (det P)11T > 0, where 1 = [1, 1, · · · , 1]T ∈ Rn .
2 Geometry on Positive Definite Matrices 49


Let W be an orthogonal matrix that has 1/ n as the first column vector. Since
the following eigen-decomposition
 
1 − nγ (V ) (det P) 0 · · · 0
⎜ . .. ⎟
⎜ 0 1 .. .⎟
In − γ (V ) (det P)11T = W ⎜
⎜ ..
⎟ WT

 .. ..
. . . 0
0 ··· 0 1

holds, the conditions (2.5) are necessary and sufficient for positive definiteness of
g(V ) . Thus, the statement follows. 

B Proof of Theorem 2

Since the components of P∗ is an affine coordinate for the connection ∗ ⊂ (V ) , the


parallel shift ∂t (Y ) along the curve ω satisfies

(gradα(V ) )∗ (Y ) = (gradα(V ) )∗ (∂t (Y ))

for any t.
From Lemma 1, this implies

d ⎪
β2 (det Pt )tr{Pt−1 ∂t (Y )}Pt−1 − β1 (det Pt )Pt−1 ∂t (Y )Pt−1 = 0
dt

for any t (−ξ < t < ξ).


By calculating the left-hand side, we get
⎛ ⎝ ⎛ ⎝
dPt dPt −1
β3 (st )tr Pt−1 tr(Pt−1 ∂t (Y ))Pt−1 − β2 (st )tr Pt−1 Pt ∂t (Y ) Pt−1
dt dt
⎛ ⎝
d∂t (Y ) dPt −1
+ β2 (st )tr Pt−1 Pt−1 − β2 (st )tr(Pt−1 ∂t (Y ))Pt−1 P
dt dt t
⎛ ⎝
dPt dPt −1
− β2 (st )tr Pt−1 Pt−1 ∂t (Y )Pt−1 + β1 (st )Pt−1 P ∂t (Y )Pt−1
dt dt t
d∂t (Y ) −1 dPt −1
− β1 (st )Pt−1 Pt + β1 (st )Pt−1 ∂t (Y )Pt−1 P = 0,
dt dt t
where st = det Pt . If t = 0, then this equation implies that

β3 (s)tr(P−1 X)tr(P−1 Y )P−1 − β2 (s)tr(P−1 XP−1 Y )P−1


⎥ ⎛ ⎝ ⎬
d∂t (Y )
+ β2 (s)tr P−1 P−1 − β2 (s)tr(P−1 Y )P−1 XP−1
dt t=0
50 A. Ohara and S. Eguchi
 
− β2 (s)tr P−1 X P−1 YP−1 + β1 (s)P−1 XP−1 YP−1
⎛ ⎝
−1 d∂t (Y )
− β1 (s)P P−1 + β1 (s)P−1 YP−1 XP−1 = 0,
dt t=0

where s = det P. Hence we observe that


⎛ ⎝
d∂t (Y )
β1 (s)P−1 (2.25)
dt
⎥ ⎛ t=0 ⎝ ⎬
−1 d∂t (Y )
= β2 (s)tr P I
dt t=0
+ β1 (s)(P−1 XP−1 Y + P−1 YP−1 X)
     
− β2 (s) tr P−1 Y P−1 X + tr P−1 X P−1 Y
     
+ β3 (s)tr P−1 X tr P−1 Y I − β2 (s)tr P−1 YP−1 X I.

Taking the trace for both sides of (2.25), we get

⎥ ⎛ ⎝ ⎬
d∂t (Y )
(β1 (s) − nβ2 (s))tr P−1 (2.26)
dt t=0
⎞ ⎠ ⎞ ⎠ ⎞ ⎠
=(2β1 (s) − nβ2 (s))tr P−1 XP−1 Y + (nβ3 (s) − 2β2 (s))tr P−1 X tr P−1 Y .

From (2.25) and (2.26) it follows that


⎛ ⎝
d∂t (Y )
P−1
dt t=0
= P−1 XP−1 Y + P−1 YP−1 X
β2 (s)   −1  −1   
− tr P X P Y + tr P−1 Y P−1 X
β1 (s)
⎞ ⎠ ⎞ ⎠ ⎞ ⎠
(β3 (s)β1 (s) − 2β2 (s)2 )tr P−1 X tr P−1 Y + β2 (s)β1 (s)tr P−1 XP−1 Y
+ I.
β1 (s)(β1 (s) − nβ2 (s))

This completes the proof. 

C Proof of Proposition 4

Since geometric structure (Ls , g(V ) ) is also invariant under the transformation φG
where G ∈ SL(n, R), it suffices to consider at κI ∈ Ls , where κ = s1/n .
Let X̃ ∈ X (Ls ) be a vector field defined at each P ∈ Ls by
2 Geometry on Positive Definite Matrices 51

 ψ
X̃ = X̃ i (P) = P1/2 XP1/2 , X ∈ TI L1 = {X|tr(X) = 0, X = X T },
ψx i
i

where X̃ i are certain smooth functions on Ls . Consider the curve Pt = κ exp Xt ∈ Ls


starting at t = 0 and a vector field Ỹ along Pt defined by

1/2 1/2
 ψ
ỸPt = Pt Yt Pt = Ỹ i (Pt ) ,
ψx i
i

where Yt is an arbitrary smooth curve in TI L1 with Y0 = Y and Ỹ iare smooth


functions on Pt . We show that the (TκI Ls )⊆ -component of ⊂ˆ (V ) Ỹ , i.e., the
X̃ κI
covariant derivative at κI orthogonal to TκI Ls , vanishes for any X and Y ∈ TI L1 if
and only if β2 (s) = 0.
We see ⎛ ⎝
dPt
(Pt )t=0 = κI, = κX, ỸκI = κY
dt t=0

hold. Note that


d 1 1/2 dYt 1/2
ỸP = (X ỸPt + ỸPt X) + Pt P ,
dt t 2 dt t
then using (2.13) and corollary 1, we obtain
 
  ⎛ ⎝ 
(V ) d (V ) ψ
⊂ˆ Ỹ = ỸP + X̃ i Ỹ j ⊂ˆ ψ 
dt t ψx i ψx
X̃ κI j
t=0 i,j κI
⎛ ⎝
κ d
= (XY + YX) + κ Yt
2 dt t=0
1 
− κ(XY + YX) + Φ(κX, κY , κI) + Φ ⊆ (κX, κY , κI)
⎛2 ⎝
d 1
=κ Yt − Φ ⊆ (κX, κY , κI).
dt t=0 2

For the third equality we have used that Φ(κX, κY , κI) = 0 for any X and Y ∈ TI L1 .
Since it holds that
    
(V ) (V ) (V )
gκI ⊂ˆ Ỹ , I = κ−1 (−β1 (s) + β2 (s)n)tr ⊂ˆ Ỹ
X̃ κI X̃ κI
 
and −β1 (s) + β2 (s)n ∀= 0 by (2.5), the (TκI Ls )⊆ -component of ⊂ˆ (V ) Ỹ vanishes
X̃ κI
for any X and Y ∈ TI L1 if and only if
  1
(V )
tr ⊂ˆ Ỹ = − trΦ ⊆ (κX, κY , κI) = 0.
X̃ κI 2
52 A. Ohara and S. Eguchi

Here, we have used tr((dYt /dt)t=0 ) = 0. The above equality is equivalent to


β2 (s) = 0. Hence, we conclude that the statement holds. 

D Proof of Proposition 6
 
The statements (i) and (ii) follow from direct calculations. Since ∗ ⊂˜ (V ) Ỹ is
  X̃ P
(V ) (V )
the orthogonal projection of ∗ ⊂ Ỹ to TP Ls with respect to gP , it can be
X̃ P
represented by    
∗ ˜ (V ) (V )
⊂ Ỹ = ∗ ⊂ Ỹ − δP, δ ∈ R,
X̃ P X̃ P

where δ is determined from the orthogonality condition


 (V )
 
(V )
gP ∗
⊂˜ X̃ Ỹ , P = 0.
P

Similarly to the proof of Proposition 4 where κ = s1/n , we see that


 
  ⎛ ⎝ 
(V ) d (V ) ψ 

⊂ Ỹ = ỸP + X̃ i Ỹ j∗ ⊂ ψ
dt t t=0 ψx i ψx
X̃ κI j
i,j κI
⎛ ⎝
κ d
= (XY + YX) + κ Yt
2 dt t=0
 
− κ(XY + YX) + Φ(κX, κY , κI) + Φ ⊆ (κX, κY , κI)
⎛ ⎝
d κ
= κ Yt − (XY + YX) − Φ ⊆ (κX, κY , κI).
dt t=0 2

Since Φ ⊆(κX, κY ,κI) ∈ (TκI Ls )⊆ and (dYt /dt)t=0 ∈ TκI Ls , the orthogonal pro-
jection of ∗ ⊂ (V ) Ỹ to TκI Ls is that of κ(dYt /dt)t=0 − κ(YX + XY )/2. Thus, from
X̃ κI
the orthogonality condition we have
  ⎛ ⎝
(V ) d κ κ

⊂˜ X̃ Ỹ =κ Yt − (XY + YX) + tr(XY )I,
κI dt t=0 2 n

which is independent of V (s). 

E Proof of Theorem 4

For P and Q in PD(n, R), we shortly write two density functions in a U-model as
fP (x) = f (x, P) and fQ (x) = f (x, Q).
2 Geometry on Positive Definite Matrices 53

It suffices to show the dual canonical divergence ∗ D(V ) (P, Q) = D(V ) (Q, P) of
(PD(n, R), ⊂, g(V ) ) given by (2.10) coincides with DU (fP , fQ ). Note that an exchange
of the order for two arguments in a divergence only causes that of the definitions
of primal and dual affine connections in (2.3) but does not affect whole dualistic
structure of the induced geometry.
Recalling (2.6), we have
 
gradα(V ) (P) = V  (det P) det P P−1 = β1 (det P)P−1 ,

where V  denotes the derivative of V by s. On the other hand, we can directly differ-
entiate α(V ) (P) defined via (2.24)

gradα(V ) (P)
⎥⎨ ⎛ ⎝ ⎬
1
= grad U − x T Px − cU (det P) dx + cU (det P)
2
⎨ ⎥ ⎬
1 T   
−1



= fP (x) − xx − cU (det P) det P P dx + cU (det P) det P P−1
2

1 1
=− fP (x)xx T dx = − EP (xx T ),
2 2

where EP is the expectation operator with respect to fP (x). Thus, we have

1
β1 (det P)P−1 = − EP (xx T ). (2.27)
2
Note that
1 1
ε(fP ) = − x T Px − cU (det P), ε(fQ ) = − x T Qx − cU (det Q)
2 2
because ε(u) is the identity. From the definition, U-divergence is
⎨ ⎛ ⎝ ⎛ ⎝
1 T 1 T
DU (fP , fQ ) = U − x Qx − cU (det Q) − U − x Px − cU (det P)
2 2
⎥ ⎬
1 T 1 T
−fP (x) − x Qx − cU (det Q) + x Px + cU (det P) dx
2 2
1  
= α(V ) (det Q) − α(V ) (det P) + EP x T Qx − x T Px .
2
Using (2.27), the third term is expressed by

1  T  1
EP x Qx − x T Px = tr{EP (xx T )(Q − P)}
2 2
= β1 (det P)tr(P−1 (P − Q)).
54 A. Ohara and S. Eguchi

Hence, DU (fP , fQ ) = ∗ D(V ) (P, Q) = D(V ) (Q, P). 

References

1. Amari, S.: Differential-geometrical methods in statistics, Lecture notes in Statistics. vol. 28,
Springer, New York (1985)
2. Amari, S., Nagaoka, H.: Methods of information geometry, AMS & OUP, Oxford (2000)
3. David, A.P.: The geometry of proper scoring rules. Ann. Inst. Stat. 59, 77–93 (2007)
4. Eguchi, S.: Information geometry and statistical pattern recognition. Sugaku Expositions Amer.
Math. Soc. 19, 197–216 (2006) (originally Sūgaku, 56, 380–399 (2004) in Japanese)
5. Eguchi, S.: Information divergence geometry and the application to statistical machine learning.
In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 309–
332. Springer, New York (2008)
6. Eguchi, S., Copas, J.: A class of logistic-type discriminant functions. Biometrika 89(1), 1–22
(2002)
7. Eguchi, S., Komori, O., Kato, S.: Projective power entropy and maximum tsallis entropy dis-
tributions. Entropy 13, 1746–1764 (2011)
8. Fang, K.T., Kotz, S., Ng, K.W.: Symmetric Multivariate and Related Distributions. Chapman
and Hall, London (1990)
9. Faraut, J., Korányi, A.: Analysis on Symmetric Cones. Oxford University Press, New York
(1994)
10. Grunwald, P.D., David, A.P.: Game theory, maximum entropy, minimum discrepancy and
robust bayesian decision theory. Ann. Stat. 32, 1367–1433 (2004)
11. Hao, J.H., Shima, H.: Level surfaces of nondegenerate functions in r n+1 . Geom. Dedicata
50(2), 193–204 (1994)
12. Helgason, S.: Differential Geometry and Symmetric Spaces. Academic Press, New York (1962)
13. Higuchi, I., Eguchi, S.: Robust principal component analysis with adaptive selection for tuning
parameters. J. Mach. Learn. Res. 5, 453–471 (2004)
14. Kanamori, T., Ohara, A.: A bregman extension of quasi-newton updates I: an information
geometrical framework. Optim. Methods Softw. 28(1), 96–123 (2013)
15. Kakihara, S., Ohara, A., Tsuchiya, T.: Information geometry and interior-point algorithms in
semidefinite programs and symmetric cone programs. J. Optim. Theory Appl. 157(3), 749–780
(2013)
16. Kass, R.E., Vos, P.W.: Geometrical Foundations of Asymptotic Inference. Wiley, New York
(1997)
17. Koecher, M.: The Minnesota Notes on Jordan Algebras and their Applications. Springer, Berlin
(1999)
18. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
19. Kurose, T.: Dual connections and affine geometry. Math. Z. 203(1), 115–121 (1990)
20. Kurose, T.: On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J.
46(3), 427–433 (1994)
21. Lauritzen, S.: Statistical manifolds. In: Amari, S.-I., et al. (eds.) Differential Geometry in
Statistical Inference, Institute of Mathematical Statistics, Hayward (1987)
22. Minami, M., Eguchi, S.: Robust blind source separation by beta-divergence. Neural Comput.
14, 1859–1886 (2002)
23. Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)
24. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of u-boost and
bregman divergence. Neural Comput. 16, 1437–1481 (2004)
25. Murray, M.K., Rice, J.W.: Differential Geometry and Statistics. Chapman & Hall, London
(1993)
2 Geometry on Positive Definite Matrices 55

26. Naudts, J.: Continuity of a class of entropies and relative entropies. Rev. Math. Phys. 16,
809–822 (2004)
27. Naudts, J.: Estimators, escort probabilities, and δ-exponential families in statistical physics. J.
Ineq. Pure Appl. Math. 5, 102 (2004)
28. Nesterov, Y.E., Todd, M.J.: Primal-dual interior-point methods for self-scaled cones. SIAM J.
Optim. 8, 324–364 (1998)
29. Nomizu, K., Sasaki, T.: Affine differential geometry. Cambridge University Press, Cambridge
(1994)
30. Ohara, A.: Geodesics for dual connections and means on symmetric cones. Integr. Eqn. Oper.
Theory 50, 537–548 (2004)
31. Ohara, A., Amari, S.: Differential geometric structures of stable state feedback systems with
dual connections. Kybernetika 30(4), 369–386 (1994)
32. Ohara, A., Eguchi, S.: Geometry on positive definite matrices induced from V-potential func-
tion. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information; Lecture Notes in
Computer Science 8085, pp. 621–629. Springer, Berlin (2013)
33. Ohara, A., Eguchi, S.: Group invariance of information geometry on q-gaussian distributions
induced by beta-divergence. Entropy 15, 4732–4747 (2013)
34. Ohara, A., Suda, N., Amari, S.: Dualistic differential geometry of positive definite matrices
and its applications to related problems. Linear Algebra Appl. 247, 31–53 (1996)
35. Ohara, A., Wada, T.: Information geometry of q-Gaussian densities and behaviors of solutions
to related diffusion equations. J. Phys. A: Math. Theor. 43, 035002 (18pp.) (2010)
36. Ollila, E., Tyler, D., Koivunen, V., Poor, V.: Complex elliptically symmetric distributions :
survey, new results and applications. IEEE Trans. signal process. 60(11), 5597–5623 (2012)
37. Rothaus, O.S.: Domains of positivity. Abh. Math. Sem. Univ. Hamburg 24, 189–235 (1960)
38. Sasaki, T.: Hyperbolic affine hyperspheres. Nagoya Math. J. 77, 107–123 (1980)
39. Scott, D.W.: Parametric statistical modeling by minimum integrated square error. Technomet-
rics 43, 274–285 (2001)
40. Shima, H.: The geometry of Hessian structures. World Scientific, Singapore (2007)
41. Takenouchi, T., Eguchi, S.: Robustifying adaboost by adding the naive error rate. Neural Com-
put. 16(4), 767–787 (2004)
42. Tsallis, C.: Introduction to Nonextensive Statistical Mechanics. Springer, New York (2009)
43. Uohashi, K., Ohara, A., Fujii, T.: 1-conformally flat statistical submanifolds. Osaka J. Math.
37(2), 501–507 (2000)
44. Uohashi, K., Ohara, A., Fujii, T.: Foliations and divergences of flat statistical manifolds.
Hiroshima Math. J. 30(3), 403–414 (2000)
45. Vinberg, E.B.: The theory of convex homogeneous cones. Trans. Moscow Math. Soc. 12,
340–430 (1963)
46. Wolkowicz, H., et al. (eds.): Handbook of Semidefinite Programming. Kluwer Academic Pub-
lishers, Boston (2000)
Chapter 3
Hessian Structures and Divergence Functions
on Deformed Exponential Families

Hiroshi Matsuzoe and Masayuki Henmi

Abstract A Hessian structure (⊂, h) on a manifold is a pair of a flat affine con-


nection ⊂ and a semi-Riemannian metric h which is given by a Hessian of some
function. In information geometry, it is known that an exponential family natu-
rally has dualistic Hessian structures and their canonical divergences coincide with
the Kullback-Leibler divergences, which are also called the relative entropies. A
deformed exponential family is a generalization of exponential families. A deformed
exponential family naturally has two kinds of dualistic Hessian structures and confor-
mal structures of Hessian metrics. In this paper, geometry of such Hessian structures
and conformal structures are summarized. In addition, divergence functions on these
Hessian manifolds are constructed from the viewpoint of estimating functions. As an
application of such Hessian structures to statistics, a generalization of independence
and geometry of generalized maximum likelihood method are studied.

Keywords Hessian manifold · Statistical manifold · Deformed exponential family ·


Divergence · Information geometry · Tsallis statistics

3.1 Introduction

In information geometry, an exponential family is a useful statistical model and it


is applied to various fields of statistical sciences (cf. [1]). For example, the set of
Gaussian distributions is an exponential family. It is known that an exponential family
can be naturally regarded as a Hessian manifold [28], which is also called a dually flat

H. Matsuzoe (B)
Department of Computer Science and Engineering, Graduate School of Engineering,
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan
e-mail: [email protected]
M. Henmi
The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 57


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_3,
© Springer International Publishing Switzerland 2014
58 H. Matsuzoe and M. Henmi

space [1] or a flat statistical manifold [12]. A pair of dually flat affine connections has
essential roles in geometric theory of statistical inferences. In addition, a Hessian
manifold has an asymmetric squared-distance like function, called the canonical
divergence. On an exponential family, the canonical divergence coincides with the
Kullback-Leibler divergence or the relative entropy. (See Sect. 3.3.)
A deformed exponential family is a generalization of exponential families, which
was introduced in anomalous statistical physics [22]. (See also [23, 32] and [33].)
A deformed exponential family naturally has two kinds of dualistic Hessian struc-
tures, and such geometric structures are independently studied in machine learning
theory [21] and statistical physics [3, 26], etc. For example, a q-exponential family
is a typical example of deformed exponential families. One of Hessian structures on
a q-exponential family is related to geometry of β-divergences (or density power di-
vergences [5]). The other Hessian structure is related to geometry of α-divergences.
(In the q-exponential case, these geometry are studied in [18].) In addition, confor-
mal structures of statistical manifolds play important roles in geometry of deformed
exponential families.
In this paper, we summarize such Hessian structures and conformal structures on
deformed exponential families. Then we construct a generalized relative entropy from
the viewpoint of estimating functions. As an application, we consider generalization
of independence of random variables, then elucidate geometry of the maximum
q-likelihood estimator. This paper is written based on the proceeding [19].

3.2 Preliminaries

In this paper, we assume that all objects are smooth, and a manifold M is an open
domain in Rn .
Let (M, h) be a semi-Riemannian manifold, that is, h is assumed to be nonde-
generate, which is not necessary to be positive definite (e.g. the Lorentzian metric in
relativity). Let ⊂ be an affine connection on M. We define the dual connection ⊂ ∗
of ⊂ with respect to h by

Xh(Y , Z) = h(⊂X Y , Z) + h(Y , ⊂X∗ Z),

where X, Y and Z are arbitrary vector fields on M. It is easy to check that (⊂ ∗ )∗ = ⊂.


For an affine connection ⊂, we define the curvature tensor field R and the torsion
tensor field T by

R(X, Y )Z := ⊂X ⊂Y Z − ⊂Y ⊂X Z − ⊂[X,Y ] Z,
T (X, Y ) := ⊂X Y − ⊂Y X − [X, Y ],

where [X, Y ] := XY − YX. We say that ⊂ is curvature-free if R vanishes everywhere


on M, and the one is torsion-free if T vanishes everywhere.
For pair of dual affine connections, the following proposition holds (cf. [16]).
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 59

Proposition 1 Consider the conditions below:


1. ⊂ is torsion-free.
2. ⊂ ∗ is torsion-free.
3. ⊂ (0) = (⊂ + ⊂ ∗ )/2 is the Levi-Civita connection with respect to h.
4. ⊂h is totally symmetric, where ⊂h is the (0, 3)-tensor field defined by

(⊂X h)(Y , Z) := Xh(Y , Z) − h(⊂X Y , Z) − h(Y , ⊂X Z).

Assume any two of the above conditions, then the others hold.
From now on, we assume that an affine connection ⊂ is torsion-free.
We say that an affine connection ⊂ is flat if ⊂ is curvature-free. For a flat affine con-
nection ⊂, there exists a coordinate system {θi } on M locally such that the connection
coefficients {Γij⊂ k } (i, j, k = 1, . . . , n) of ⊂ vanish on its coordinate neighbourhood.
We call such a coordinate system {θi } an affine coordinate system.
Let (M, h) be a semi-Riemannian manifold, and let ⊂ be a flat affine connection
on M. We say that the pair (⊂, h) is a Hessian structure on M if there exists a function
ψ, at least locally, such that h = ⊂dψ [28]. In the coordinate form, the following
formula holds:
∂2
hij (p(θ)) = i j ψ(p(θ)),
∂θ ∂θ

where p is an arbitrary point in M and {θi } is a ⊂-affine coordinate system around p.


Under the same assumption, we call the triplet (M, ⊂, h) a Hessian manifold. For a
Hessian manifold (M, ⊂, h), we define a totally symmetric (0, 3)-tensor field C by
C := ⊂h. We call C the cubic form for (M, ⊂, h).
For a semi-Riemannian manifold (M, h) with a torsion-free affine connection ⊂,
the triplet (M, ⊂, h) is said to be a statistical manifold if ⊂h is totally symmetric [12].
Originally, the triplet (M, g, C) is called a statistical manifold [14], where (M, g)
is a Riemannian manifold and C is a totally symmetric (0, 3)-tensor field on M.
From Proposition 1, these definitions are essentially equivalent. In fact, for a semi-
Riemannian manifold (M, h) with a totally symmetric (0, 3)-tensor field C, we can
define mutually dual torsion-free affine connections ⊂ and ⊂ ∗ by

(0) 1
h(⊂X Y , Z) := h(⊂X Y , Z) − C(X, Y , Z), (3.1)
2
(0) 1
h(⊂X∗ Y , Z) := h(⊂X Y , Z) + C(X, Y , Z), (3.2)
2

where ⊂ (0) is the Levi-Civita connection with respect to h. In this case, ⊂h and ⊂ ∗ h
are totally symmetric. Hence (M, ⊂, h) and (M, ⊂ ∗ , h) are statistical manifolds.
A triplet (M, ⊂, h) is a flat statistical manifold if and only if it is a Hessian manifold
(cf. [28]). Suppose that R and R∗ are curvature tensors of ⊂ and ⊂ ∗ , respectively.
Then we have
h(R(X, Y )Z, V ) = −h(Z, R∗ (X, Y )V ).
60 H. Matsuzoe and M. Henmi

Hence the condition that the triplet (M, ⊂, h) is a Hessian manifold is equivalent to
that the quadruplet (M, h, ⊂, ⊂ ∗ ) is a dually flat space [1].
For a Hessian manifold (M, ⊂, h), we suppose that {θi } is a ⊂-affine coordinate
system on M. Then there exists a ⊂ ∗ -affine coordinate system {ηi } such that
 
∂ ∂
h , = δji .
∂θi ∂ηj

We call {ηi } the dual coordinate system of {θi } with respect to h.

Proposition 2 Let (M, ⊂, h) be a Hessian manifold. Suppose that {θi } is a ⊂-affine


coordinate system, and {ηi } is the dual coordinate system of {θi }. Then there exist
functions ψ and φ on M such that

∂ψ ∂φ  n
= ηi , = θi , ψ(p) + φ(p) − θi (p)ηi (p) = 0, (p ∈ M), (3.3)
∂θi ∂ηi
i=1
∂2ψ ∂2φ
hij = , h ij
= ,
∂θi ∂θj ∂ηi ∂ηj

where (hij ) is the component matrix of a semi-Riemannian metric h with respect to


{θi }, and (hij ) is the inverse matrix of (hij ). Moreover,

∂3ψ
Cijk = (3.4)
∂θi ∂θj ∂θk
is the cubic form of (M, ⊂, h).

For proof, see [1] and [28]. The functions ψ and φ are called the θ-potential and
the η-potential, respectively. From the above proposition, the Hessians of θ-potential
and η-potential coincide with the semi-Riemannian metric h:

∂ηi ∂2ψ ∂θi ∂2φ


= = hij , = = hij . (3.5)
∂θj ∂θi ∂θj ∂ηj ∂ηi ∂ηj

In addition, we obtain the original flat connection ⊂ and its dual ⊂ ∗ from the potential
function ψ. From Eq. (3.4), we have the cubic form of Hessian manifold (M, ⊂, h).
Then we obtain two affine connections ⊂ and ⊂ ∗ by Eqs. (3.1), (3.2) and (3.4).
Under the same assumptions as in Proposition 2, we define a function D on M ×M
by
n
D(p, r) := ψ(p) + φ(r) − θi (p)ηi (r), (p, r ∈ M).
i=1

We call D the canonical divergence of (M, ⊂, h). The definition is independent of


choice of an affine coordinate system. The canonical divergence is an asymmetric
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 61

squared distance like function on M. In particular, the canonical divergence D is


non-negative if the metric h is positive definite. However, we assumed that h is a
semi-Riemannian metric, hence D can take negative values. (cf. [12] and [15].)
We remark that the canonical divergence induces the original Hessian manifold
(M, ⊂, h) by Eguchi’s relation [7]. Suppose that D is a function on M ×M. We define
a function on M by the following formula:

D[X1 , . . . , Xi |Y1 , . . . , Yj ](p) := (X1 )p · · · (Xi )p (Y1 )r · · · (Yj )r D(p, r)|p=r ,

where X1 , . . . , Xi and Y1 , · · · , Yj are vector fields on M. We say that D is a contrast


function on M × M if

1. D[ | ](p) = D(p, p) = 0,
2. D[X| ](p) = D[ |X](p) = 0,
3. h(X, Y ) := −D[X|Y ] (3.6)
is a semi-Riemannian metric on M.

For a contrast function D on M × M, we define a pair of affine connections by

h(⊂X Y , Z) = −D[XY |Z],


h(Y , ⊂X∗ Z) = −D[Y |XZ].

By differentiating Eq. (3.6), two affine connections ⊂ and ⊂ ∗ are mutually dual with
respect to h. We can check that ⊂ and ⊂ ∗ are torsion-free, and ⊂h and ⊂ ∗ h are totally
symmetric. Hence triplets (M, ⊂, h) and (M, ⊂ ∗ , h) are statistical manifolds. We call
(M, ⊂, h) the induced statistical manifold from a contrast function D. If (M, ⊂, h) is
a Hessian manifold, we say that (M, ⊂, h) is the induced Hessian manifold from D.

Proposition 3 Suppose that D is the canonical divergence on a Hessian manifold


(M, ⊂, h). Then D is a contrast function on M × M which induces the original
Hessian manifold (M, ⊂, h).

Proof From the definition and Eq. (3.3), we have D[ | ] = 0 and D[X| ] = D[ |X] = 0.
Let {θi } be a ⊂-affine coordinate and {ηj } the dual affine coordinate of {θj }. Set
∂i = ∂/∂θi . From Eqs. (3.3) and (3.5), we have

D[∂i |∂j ](p) = (∂i )p (∂j )r D(p, q)|p=r = (∂j )r {ηi (p) − ηi (r)} |p=r
= −(∂j )r ηi (r)|p=r = −hij (p).

This implies that the canonical divergence D is a contrast function on M ×M. Induced
affine connections are given by
⎛ ⎝
Γij,k = −D[∂i ∂j |∂k ] = (∂i )p (∂k )r ηj (p) − ηj (r) |p=r
= −(∂i )p (∂k )r ηj (r)|p=r = 0,
62 H. Matsuzoe and M. Henmi


⎛ ⎝
Γik,j = −D[∂j |∂i ∂k ] = (∂i )r (∂k )r ηj (p) − ηj (r) |p=r
= −(∂i )r (∂k )r ηj (r)|p=r = −(∂i )r (∂k )r (∂j )r ψ(r)|p=r
= Cikj ,

∗ are Christoffel symbols of the first kind of ⊂ and ⊂ ∗ , respec-


where Γij,k and Γik,j
tively. From Eqs. (3.1) and (3.2), since h is nondegenerate, the affine connection ⊂
coincides with the original one of (M, ⊂, h). 
At the end of this section, we review generalized conformal equivalence for statis-
tical manifolds. Fix a number α ∈ R. We say that two statistical manifolds (M, ⊂, h)
and (M, ⊂,¯ h̄) are α-conformally equivalent if there exists a function ϕ on M such
that

h̄(X, Y ) = eϕ h(X, Y ),
1+α 1−α
⊂¯ X Y = ⊂X Y − h(X, Y )gradh ϕ + {dϕ(Y ) X + dϕ(X) Y } ,
2 2
where gradh ϕ is the gradient vector field of ϕ with respect to h, that is,

h(gradh ϕ, X) := Xϕ.

(The vector field gradh ϕ is often called the natural gradient of ϕ in neurosciences,
etc.) We say that a statistical manifold (M, ⊂, h) is α-conformally flat if it is locally
α-conformally equivalent to some Hessian manifold [12].
Suppose that D and D̄ are contrast functions on M × M. We say that D and D̄ are
α-conformally equivalent if there exists a function ϕ on M such that
⎞ ⎠ ⎞ ⎠
1+α 1−α
D̄(p, r) = exp ϕ(p) exp ϕ(r) D(p, r).
2 2

In this case, induced statistical manifolds (M, ⊂, h) and (M, ⊂, ¯ h̄) from D and D̄,
respectively, are α-conformally equivalent.
Historically, conformal equivalence of statistical manifolds was introduced in
asymptotic theory of sequential estimation [27]. (See also [11].) Then it is generalized
in affine differential geometry (e.g. [10, 12, 13] and [17]). As we will see in Sects. 3.5
and 3.6, conformal structures on a deformed exponential family play important roles.
(See also [2, 20, 24] and [25].)

3.3 Statistical Models

Let (Ω, F , P) be a probability space, that is, Ω is a sample space, F is a completely


additive class on Ω, and P is a probability measure on Ω. Let Ξ be an open subset
in Rn . We say that S is a statistical model if S is a set of probability density functions
on Ω with parameter ξ = t (ξ 1 , . . . , ξ n ) ∈ Ξ such that
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 63
⎧ ⎤ ⎫
⎨ ⎤⎥ ⎬

S := p(x; ξ) ⎤⎤ p(x; ξ)dx = 1, p(x; ξ) > 0, ξ ∈ Ξ ⊂ Rn .
⎩ ⎤ ⎭
Ω

Under suitable conditions, S can be regarded as a manifold with local coordinate


system {ξ i } [1]. In particular, we assume that we can interchange differentials and
integrals. Hence, the equation below holds
⎥   ⎥
∂ ∂ ∂
p(x; ξ) dx = p(x; ξ)dx = i 1 = 0.
∂ξ i ∂ξ i ∂ξ
Ω Ω

For a statistical model S, we define the Fisher information matrix gF (ξ) = (gijF (ξ))
by
⎥   
∂ ∂
gijF (ξ) := log p(x; ξ) log p(x; ξ) p(x; ξ) dx (3.7)
∂ξ i ∂ξ j
Ω
= Ep [∂i lξ ∂j lξ ],

where ∂i = ∂/∂ξ i , lξ = l(x; ξ) = log p(x; ξ), and Ep [f ] is the expectation of f (x)
with respect to p(x; ξ). The Fisher information matrix gF is semi-positive definite
in general. Assuming that gF is positive definite and all components are finite, then
gF can be regarded as a Riemannian metric on S. We call gF the Fisher metric on S.
The Fisher metric gF has the following representations:
⎥   
∂ ∂
gijF (ξ) = p(x; ξ) log p(x; ξ) dx (3.8)
∂ξ i ∂ξ j
Ω
⎥   
1 ∂ ∂
= p(x; ξ) p(x; ξ) dx. (3.9)
p(x; ξ) ∂ξ i ∂ξ j
Ω

Next, let us define an affine connection on S. For a fixed α ∈ R, an α-


connection ⊂ (α) on S is defined by
⎞  ⎠
(α) 1−α
Γij,k (ξ) := Ep ∂i ∂j l ξ + ∂i lξ ∂j lξ (∂k lξ ) ,
2

(α)
where Γij,k is the Christoffel symbol of the first kind of ⊂ (α) .
We remark that ⊂ (0) is the Levi-Civita connection with respect to the Fisher
metric gF . The connection ⊂ (e) := ⊂ (1) is called the the exponential connection and
⊂ (m) := ⊂ (−1) is called the mixture connection. Two connections ⊂ (e) and ⊂ (m) are
expressed as follows:
64 H. Matsuzoe and M. Henmi

(e)
Γij,k = Ep [(∂i ∂j lξ )(∂k lξ )] = ∂i ∂j log p(x; ξ)∂k p(x; ξ)dx, (3.10)
Ω

(m)
Γij,k = Ep [((∂i ∂j lξ + ∂i lξ ∂j lξ )(∂k lξ )] = ∂i ∂j p(x; ξ)∂k log p(x; ξ)dx. (3.11)
Ω

We can check that the α-connection ⊂ (α) is torsion-free and ⊂ (α) gF is totally
symmetric. These imply that (S, ⊂ (α) , gF ) forms a statistical manifold. In addition,
it is known that the Fisher metric gF and the α-connection ⊂ (α) are independent
of choice of dominating measures on Ω. Hence we call the triplet (S, ⊂ (α) , gF ) an
invariant statistical manifold. The cubic form C F of the invariant statistical manifold
(S, ⊂ (e) , gF ) is given by
(m) (e)
F
Cijk = Γij,k − Γij,k .

A statistical model Se is said to be an exponential family if


 ⎤  n ⎜ 
⎤ 

Se := p(x; θ) ⎤ p(x; θ) = exp θ Fi (x) − ψ(θ) , θ ∈ Θ ⊂ R ,
i n

i=1

under a choice of suitable dominating measure, where F1 (x), . . . , Fn (x) are functions
on the sample space Ω, θ = (θ1 , . . . , θn ) is a parameter, and ψ(θ) is a function of θ for
normalization. The following proposition is well-known in information geometry [1].

Theorem 1 (cf. [1]) For an exponential family Se , the following hold:


1. (Se , ⊂ (e) , gF ) and (Se , ⊂ (m) , gF ) are mutually dual Hessian manifolds, that is,
(Se , gF , ⊂ (e) , ⊂ (m) ) is a dually flat space.
2. {θi } is a ⊂ (e) -affine coordinate system on Se .
3. For the Hessian structure (⊂ (e) , gF ) on Se , ψ(θ) is the potential of gF and C F
with respect to {θi }:

gijF (θ) = ∂i ∂j ψ(θ), (∂i = ∂/∂θi ),


F
Cijk (θ) = ∂i ∂j ∂k ψ(θ).

4. Set the expectation of Fi (x) by ηi := Ep [Fi (x)]. Then {ηi } is the dual affine
coordinate system of {θi } with respect to gF .
5. Set φ(η) := Ep [log p(x; θ)]. Then φ(η) is the potential of gF with respect to {ηi }.

Since (Se , ⊂ (e) , gF ) is a Hessian manifold, the formulas in Proposition 2 hold.


For a statistical model S, we define a Kullback-Leibler divergence (or a relative
entropy) by
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 65

p(x)
DKL (p, r) := p(x) log dx
r(x)
Ω
= Ep [log p(x) − log r(x)], (p(x), r(x) ∈ S).

The Kullback-Leibler divergence DKL on an exponential family Se coincides with


the canonical divergence D on (Se , ⊂ (m) , gF ).
We define an Rn valued function s(x; ξ) = (s1 (x; ξ), . . . , sn (x; ξ))T by

si (x; ξ) := log p(x; ξ).
∂ξ i

We call s(x; ξ) the score function of p(x; ξ) with respect to ξ. In information geometry,
si (x; ξ) is called the e-(exponential) representation of ∂/∂ξ i , and ∂/∂ξ i p(x; ξ) is
called the m-(mixture) representation. The duality of e- and m-representations is
important. In fact, Eq. (3.8) implies that the Fisher metric gF is nothing but an L 2
inner product of e- and m-representations.
Construction of the Kullback-Leibler divergence is as follows. We define a cross
entropy dKL (p, r) by
dKL (p, r) := −Ep [log r(x)].

A cross entropy dKL (p, r) gives a bias of information − log r(x) with respect to p(x).
A cross entropy is also called a yoke on S [4]. Intuitively, a yoke measures a dissim-
ilarity of two probability density functions on S. We should also note that the cross
entropy is obtained by taking the expectation with respect to p(x) of the integrated
score function at r(x). Then we have the Kullback-Leibler divergence by

DKL (p, r) = −dKL (p, p) + dKL( p, r)


= Ep [log p(x) − log r(x)].

The Kullback-Leibler divergence DKL is a normalized yoke on S, which satisfies


DKL (p, p) = 0. This argument suggests how to construct divergence functions. Once
a function like the cross entropy is defined, we can construct divergence functions in
the same way.

3.4 The Deformed Exponential Family

In this section, we review the deformed exponential family. For more details, see
[3, 22, 23] and [26]. Geometry of deformed exponential families relates to so-called
U-geometry [21].
Let χ be a strictly increasing function from (0, ≡) to (0, ≡). We define a deformed
logarithm function (or a χ-logarithm function) by
66 H. Matsuzoe and M. Henmi

⎥s
1
logχ (s) := dt.
χ(t)
1

We remark that logχ (s) is strictly increasing and satisfies logχ (1) = 0. The do-
main and the target of logχ (s) depend on the function χ(t). Set U = {s ∈
(0, ≡) | | logχ (s)| < ≡} and V = {logχ (s) | s ∈ U}. Then logχ (s) is a function
from U to V . We also remark that the deformed logarithm is usually called the
φ-logarithm [23]. However, we use φ as the dual potential on a Hessian manifold.
A deformed exponential function (or a χ-exponential function) is defined by the
inverse of the deformed logarithm function logχ (s):

⎥t
expχ (t) := 1 + λ(s)ds,
0

where λ(s) is defined by the relation λ(logχ (s)) := χ(s).


When χ(s) is a power function χ(s) = sq , (q > 0, q ∇= 1), the deformed logarithm
and the deformed exponential are given by

s1−q − 1
logq (s) := , (s > 0),
1−q
1
expq (t) := (1 + (1 − q)t) 1−q , (1 + (1 − q)t > 0).

The function logq (s) is called the q-logarithm and expq (t) the q-exponential. Taking
the limit q ◦ 1, the standard logarithm and the standard exponential are recovered,
respectively.
A statistical model Sχ is said to be a deformed exponential family (or a χ-
exponential family) if
 ⎤  n ⎜ 
⎤ 

Sχ := p(x; θ) ⎤p(x; θ) = expχ θ Fi (x) − ψ(θ) , θ ∈ Θ ⊂ R ,
i n

i=1

under a choice of suitable dominating measure, where F1 (x), . . . , Fn (x) are functions
on the sample space Ω, θ = {θ1 , . . . , θn } is a parameter, and ψ(θ) is the function of θ
for normalization. We assume that Sχ is a statistical model in the sense of [1]. That is,
p(x; θ) has support entirely on Ω, there exits a one-to-one correspondence between
the parameter θ and the probability distribution p(x; θ), and differentiation and inte-
gration are interchangeable. In addition, functions {Fi (x)}, ψ(θ) and parameters {θi }
must satisfy the anti-exponential condition. For example, in the q-exponential case,
these functions satisfy
n
1
θi Fi (x) − ψ(θ) < .
q−1
i=1
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 67

Then we can regard that Sχ is a manifold with local coordinate system {θi }. We also
assume that the function ψ is strictly convex since we consider Hessian metrics on
Sχ later. A deformed exponential family has several different definitions. See [30]
and [34], for example.
For a deformed exponential probability density p(x; θ) ∈ Sχ , we define the escort
distribution Pχ (x; θ) of p(x; θ) by

1
Pχ (x; θ) := χ{p(x; θ)},
Zχ (θ)

where Zχ (θ) is the normalization defined by



Zχ (θ) := χ{p(x; θ)}dx.
Ω

The χ-expectation Eχ,p [f ] of f (x) with respect to Pχ (x; θ) is defined by


⎥ ⎥
1
Eχ,p [f ] := f (x)Pχ (x; θ) dx = f (x)χ{p(x; θ)}dx.
Zχ (θ)
Ω Ω

When χ is a power function χ(s) = sq , (q > 0, q ∇= 0), we denote the escort


distribution of p(x; θ) by Pq (x; θ), and the χ-expectation with respect to p(x; θ) by
Eq,p [∗].

Example 1 (discrete distributions [3]) The set of discrete distributions Sn is a


deformed exponential family for an arbitrary χ. Suppose that Ω is a finite set:
Ω = {x0 , x1 , . . . , xn }. Then the statistical model Sn is given by
 ⎤ 
⎤ 
n 
n

Sn := p(x; η) ⎤ ηi > 0, p(x; η) = ηi δi (x), ηi = 1 ,

i=0 i=0
n
where η0 := 1 − i=1 ηi and

1 (x = xi ),
δi (x) :=
0 (x ∇= xi ).

Set θi = logχ p(xi ) − logχ p(x0 ) = logχ ηi − logχ η0 , Fi (x) = δi (x) and ψ(θ) =
− logχ η0 . Then the χ-logarithm of p(x) ∈ Sn is written by
68 H. Matsuzoe and M. Henmi


n
 
logχ p(x) = logχ ηi − logχ η0 δi (x) + logχ (η0 )
i=1

n
= θi Fi (x) − ψ(θ).
i=1

This implies that Sn is a deformed exponential family.

Example 2 (q-normal distributions [20]) A q-normal distribution is the probability


distribution defined by the following formula:

⎞ ⎠ 1
1 1 − q (x − μ)2 1−q
pq (x; μ, σ) := 1− ,
Zq (σ) 3 − q σ2 +

where [∗]+ := max{0, ∗}, {μ, σ} are parameters −≡ < μ < ≡, 0 < σ < ≡, and
Zq (σ) is the normalization defined by
⎧→  
⎪ →3 − q B 2 − q , 1 σ,

⎨ (−≡ < q < 1),
Zq (σ) := →1 − q  1 − q 2 

⎪ 3−q 3−q 1
⎩→ B , σ, (1 ∅ q < 3).
q−1 2(q − 1) 2

Here, B (∗, ∗) is the beta function. We restrict ourselves to consider the case q ← 1.
Then the probability distribution pq (x; μ, σ) has its support entirely on R and the set
of q-normal distributions Sq is a statistical model. Set

2 μ 1 1
θ1 := {Zq (σ)}q−1 2 , θ2 := − {Zq (σ)}q−1 2 ,
3−q σ 3−q σ
(θ1 )2 {Zq (σ)}q−1 − 1
ψ(θ) := − 2 − ,
4θ 1−q

then we have
1
logq pq (x; θ) = ({pq (x; θ)}1−q − 1)
1−q
⎟   
1 1 1 − q (x − μ)2
= 1 − − 1
1 − q {Zq (σ)}1−q 3 − q σ2
2μ{Zq (σ)}q−1 {Zq (σ)}q−1 2
= x − x
(3 − q)σ 2 (3 − q)σ 2
{Zq (σ)}q−1 μ2 {Zq (σ)}q−1 − 1
− · 2+
3−q σ 1−q
= θ1 x + θ2 x 2 − ψ(θ).
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 69

This implies that the set of q-normal distributions Sq is a q-exponential family.


For a q-normal distribution pq (x; μ, σ), the q-expectation μq and a q-variance σq2 are
given by

μq = Eq,p [x] = μ,
 
σq2 = Eq,p (x − μ)2 = σ 2 .

We remark that a q-normal distribution is nothing but a three-parameter version of


Student’s t-distribution when q ← 1. In fact, if q = 1, then the q-normal distribution
is the normal distribution. If q = 2, then the distribution is the Cauchy distribution.
We also remark that mathematical properties of q-normal distributions have been
obtained by several authors. See [29, 31], for example.

3.5 Geometry of Deformed Exponential Families Derived


from the Standard Expectation

In this section, we consider geometry of deformed exponential families by general-


izing the e-representation with the deformed logarithm function. For more details,
see [21, 26].
Let Sχ be a deformed exponential family. We define an Rn valued function
 T
s (x; θ) = (sχ )1 (x; θ), . . . , (sχ )n (x; θ) by
χ


(sχ )i (x; θ) := logχ p(x; θ), (i = 1, . . . , n). (3.12)
∂θi
We call sχ (x; θ) the χ-score function of p(x; θ). Using the χ-score function, we
define a (0, 2)-tensor field gM on Sχ by
⎥  

gijM (θ) := ∂i p(x; θ)∂j logχ p(x; θ) dx, ∂i = i . (3.13)
∂θ
Ω

Lemma 1 The tensor field gM on Sχ is semi-positive definite.

Proof From the definitions of gM and logχ , the tensor field gM is written as

 
gijM (θ) = χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx. (3.14)
Ω

Since χ is strictly increasing, gM is semi-positive definite. 


70 H. Matsuzoe and M. Henmi

From now on, we assume that gM is positive definite. Hence gM is a Riemannian


metric on Sχ . This assumption is same as in the case of Fisher metric. The Riemannian
metric gM is a generalization of the Fisher metric in terms of the representation (3.8).
We can consider other types of generalizations of the Fisher metric as follows.


  
gijE (θ) := ∂i logχ p(x; θ) ∂j logχ p(x; θ) Pχ (x; θ)dx
Ω
= Eχ,p [∂i lχ (θ)∂j lχ (θ)],

1  
gij (θ) :=
N
(∂i p(x; θ)) ∂j p(x; θ) dx,
Pχ (x; θ)
Ω

where lχ (θ) = logχ p(x; θ). Obviously, gE and gN are generalizations of the Fisher
metic with respect to the representations (3.7) and (3.9), respectively.
Proposition 4 Let Sχ be a deformed exponential family. Then Riemannian metrics
gE , gM and gN are mutually conformally equivalent. In particular, the following
formulas hold:
1
Zχ (θ)gE (θ) = gM (θ) = gN (θ),
Zχ (θ)

where Zχ (θ) is the normalization of the escort distribution Pχ (x; θ).


Proof For a deformed exponential family Sχ , the differentials of probability density
functions are given as follows:
 
∂ ∂
p(x; θ) = χ(p(x; θ)) Fi (x) − i ψ(θ) ,
∂θi ∂θ
∂ ∂
logχ p(x; θ) = Fi (x) − i ψ(θ).
∂θi ∂θ

From the above formula and the definitions of Riemannian metrics gE and gN , we
have

1  
gijE (θ) = χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx,
Zχ (θ)
Ω

 
gijN (θ) = Zχ (θ) χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx.
Ω

These equations and Eq. (3.14) imply that Riemannian metrics gE , gM and gN are
mutually conformally equivalent. 
Among the three possibilities of generalizations of the Fisher metric, gM is espe-
cially associated with a Hessian structure on Sχ , as we will see below. Although the
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 71

meaning of gE is unknown, gN gives a kind of Cramér-Rao lower bound in statistical


inferences. (See [22, 23].)
By differentiating Eq. (3.13), we can define mutually dual affine connections
⊂ M(e) and ⊂ M(m) on Sχ by

M(e)
Γij,k (θ) := ∂k p(x; θ)∂i ∂j logχ p(x; θ)dx,
Ω

M(m)
Γij,k (θ) := ∂i ∂j p(x; θ)∂k logχ p(x; θ)dx.
Ω

From the definitions of the deformed exponential family and the deformed log-
M(e)
arithm function, Γij,k vanishes identically. Hence the connection ⊂ M(e) is flat,
and (⊂ M(e) , gM ) is a Hessian structure on Sχ . Denote by C M the cubic form of
(Sχ , ⊂ M(e) , gM ), that is,

M(m) M(e) M(m)


M
Cijk = Γij,k − Γij,k = Γij,k .

For t > 0, set a function Vχ (t) by

⎥t
Vχ (t) := logχ (s) ds.
1

We assume that Vχ (0) = limt◦+0 Vχ (t) is finite. Then the generalized entropy
functional Iχ and the generalized Massieu potential Ψ are defined by

⎛ ⎝
Iχ (pθ ) := − Vχ (p(x; θ)) + (p(x; θ) − 1)Vχ (0) dx,
Ω

Ψ (θ) := p(x; θ) logχ p(x; θ)dx + Iχ (pθ ) + ψ(θ),
Ω

respectively, where ψ is the normalization of the deformed exponential family.


Theorem 2 (cf. [21, 26]) For a deformed exponential family Sχ , the following hold:
1. (Sχ , ⊂ M(e) , gM ) and (Sχ , ⊂ M(m) , gM ) are mutually dual Hessian manifolds, that
is, (Sχ , gM , ⊂ M(e) , ⊂ M(m) ) is a dually flat space.
2. {θi } is a ⊂ M(e) -affine coordinate system on Sχ .
3. Ψ (θ) is the potential of gM and C M with respect to {θi }, that is,

gijM (θ) = ∂i ∂j Ψ (θ),


M
Cijk (θ) = ∂i ∂j ∂k Ψ (θ).
72 H. Matsuzoe and M. Henmi

4. Set the expectation of Fi (x) by ηi := Ep [Fi (x)]. Then {ηi } is a ⊂ M(m) -affine
coordinate system on Sχ and the dual of {θi } with respect to gM .
5. Set Φ(η) := −Iχ (pθ ). Then Φ(η) is the potential of gM with respect to {ηi }.

Let us construct a divergence function which induces the Hessian manifold


χ
(Sχ , ⊂ M(e) , gM ). We define the bias corrected χ-score function up (x; θ) of p(x; θ)
by ⎞ ⎠
χ i ∂ ∂
(up ) (x; θ) := i logχ p(x; θ) − Ep logχ p(x; θ) .
∂θ ∂θi

Set a function Uχ (t) by


⎥s
Uχ (s) := expχ (t) dt.
0

Then we have
⎥s  
d
Vχ (s) = s logχ (s) − t logχ (t) dt
dt
1
logχ (s)

= s logχ (s) − expχ (u)du
0
= s logχ (s) − Uχ (logχ (s)).

Since ∂/∂θi Vχ (p(x; θ)) = (∂/∂θi p(x; θ)) logχ p(x; θ), we have
 
∂ ∂
p(x; θ) logχ p(x; θ) = i Uχ (logχ p(x; θ)).
∂θi ∂θ

Hence, by integrating the bias corrected χ-score function at r(x; θ) ∈ Sχ with respect
to θ, and by taking the standard expectation with respect to p(x; θ), we define a
χ-cross entropy of Bregman type by
⎥ ⎥
dχM (p, r) = − p(x) logχ r(x)dx + Uχ (logχ r(x))dx.
Ω Ω

Then we obtain the χ-divergence (or U-divergence) by

Dχ (p, r) = −dχM (p, p) + dχM (p, r)




= Uχ (logχ r(x)) − Uχ (logχ p(x))
Ω

−p(x)(logχ r(x) − logχ p(x)) dx.
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 73

In the q-exponential case, the bias corrected q-score function is given by


⎞ ⎠
∂ ∂
uqi (x; θ) = logq p(x; θ) − Ep logq p(x; θ)
∂θi ∂θi
⎧ ⎫

∂ ⎨ 1 1 ⎬
= p(x; θ)1−q
− p(x; θ)2−q
dx
∂θi ⎩ 1 − q 2−q ⎭
Ω
= p(x; θ)1−q si (x; θ) − Ep [p(x; θ)1−q si (x; θ)].

This score function is nothing but a weighted score function in robust statistics. The
χ-divergence constructed from the bias corrected q-score function coincides with
the β-divergence (β = 1 − q):

D1−q (p, r) = −d1−q (p, p) + d1−q (p, r)



1
= p(x)2−q dx
(1 − q)(2 − q)
Ω
⎥ ⎥
1 1
− p(x)r(x)1−q dx + r(x)2−q dx.
1−q 2−q
Ω Ω

3.6 Geometry of Deformed Exponential Families Derived


from the χ-Expectation

Since Sχ is linearizable by the deformed logarithm function, we can naturally define


geometric structures from the potential function ψ.
A χ-Fisher metric gχ and a χ-cubic form C χ are defined by
χ
gij (θ) := ∂i ∂j ψ(θ),
χ
Cijk (θ) := ∂i ∂j ∂k ψ(θ),

respectively [3]. In the q-exponential case, we denote the χ-Fisher metric by gq , and
the χ-cubic form by C q . We call gq and C q a q-Fisher metric and a q-cubic form,
respectively.
Let ⊂ χ(0) be the Levi-Civita connection with respect to the χ-Fisher metric gχ .
Then a χ-exponential connection⊂ χ(e) and a χ-mixture connection⊂ χ(m) are defined
by

χ(e) 1
χ(0)
gχ (⊂X Y , Z) := gχ (⊂X
Y , Z) − C χ (X, Y , Z),
2
χ(m) χ(0) 1
gχ (⊂X Y , Z) := gχ (⊂X Y , Z) + C χ (X, Y , Z),
2
74 H. Matsuzoe and M. Henmi

respectively. The following theorem is known in [3].


Theorem 3 (cf. [3]) For a deformed exponential family Sχ , the following hold:
1. (Sχ , ⊂ χ(e) , gχ ) and (Sχ , ⊂ χ(m) , gχ ) are mutually dual Hessian manifolds, that
is, (Sχ , gχ , ⊂ χ(e) , ⊂ χ(m) ) is a dually flat space.
2. {θi } is a ⊂ χ(e) -affine coordinate system on Sχ .
3. ψ(θ) is the potential of gχ and C χ with respect to {θi }.
4. Set the χ-expectation of Fi (x) by ηi := Eχ,p [Fi (x)]. Then {ηi } is a ⊂ χ(m) -affine
coordinate system on Sχ and the dual of {θi } with respect to gχ .
5. Set φ(η) := Eχ,p [logχ p(x; θ)]. Then φ(η) is the potential of gχ with respect to
{ηi }.
Proof Statements 1, 2 and 3 are easily obtained from the definitions of χ-Fisher
metric and χ-cubic form. From Eq. (3.3) and ηi = Eχ,p [Fi (x)], Statements 4 and 5
follow from the fact that
 n ⎜
  n
Eχ,p [logχ p(x; θ)] = Eχ,p θ Fi (x) − ψ(θ) =
i
θi ηi − ψ(θ). 
i=1 i=1

Suppose that sχ (x; θ) is the χ-score function defined by (3.12). The χ-score is unbi-
ased with respect to χ-expectation, that is, Eχ,p [(sχ )i (x; θ)] = 0. Hence we regard
that sχ (x; θ) is a generalization of unbiased estimating functions.
By integrating a χ-score function, we define the χ-cross entropy by

d χ (p, r) := −Eχ,p [logχ r(x)]



= − P(x) logχ r(x)dx.
Ω

Then we obtain the generalized relative entropy Dχ (p, r) by

Dχ (p, r) := −d χ (p, p) + d χ (p, r)


= Eχ,p [logχ p(x) − logχ r(x)]. (3.15)

The generalized relative entropy Dχ (p, r) coincides with the canonical divergence
D(r, p) for (Sχ , ⊂ χ(e) , gχ ). In fact, from (3.15), we can check that
 n   n ⎜
 
χ ≤ ≤ i ≤
D (p(θ), p(θ )) = Eχ,p θ Fi (x) − ψ(θ) −
i
(θ ) Fi (x) − ψ(θ )
i=1 i=1
 n 
n
= ψ(θ≤ ) + θi ηi − ψ(θ) − (θ≤ )i ηi = D(p(θ≤ ), p(θ)).
i=1 i=1

Let us consider the q-exponential case. We assume that a q-exponential family Sq


admits an invariant statistical manifold structure (Sq , ⊂ (α) , gF ).
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 75

Theorem 4 ([20]) For a q-exponential family Sq , the invariant statistical manifold


(Sq , ⊂ (2q−1) , gF ) and the Hessian manifold (Sq , ⊂ q(e) , gq ) are 1-conformally
equivalent. In this case, the invariant statistical manifold (Sq , ⊂ (2q−1) , gF ) is
1-conformally flat.

Divergence functions for (Sq , ⊂ q(e) , gq ) and (Sq , ⊂ (2q−1) , gF ) are given as fol-
lows. The α-divergence D(α) (p, r) with α = 1 − 2q is defined by
⎧ ⎫
⎨ ⎥ ⎬
1
D(1−2q) (p, r) := 1− p(x)q r(x)1−q dx .
q(1 − q) ⎩ ⎭
Ω

On the other hand, the normalized Tsallis relative entropy DqT (p, r) is defined by

 
DqT (p, r) := Pq (x) logq p(x) − logq r(x) dx
Ω
= Eq,p [logq p(x) − logq r(x)].

We remark that the invariant statistical manifold (Sq , ⊂ (1−2q) , gF ) is induced from
the α-divergence with α = 1 − 2q, and that the Hessian manifold (Sq , ⊂ q(e) , gq )
is induced from the dual of the normalized Tsallis relative entropy. In fact, for a
q-exponential family Sq , divergence functions have the following relations:

D(r, p) = DqT (p, r)



p(x)q  
= logq p(x) − logq r(x) dx
Zq (p)
Ω
⎥  
1 p(x) − p(x)q p(x)q r(x)1−q − p(x)q
= − dx
Zq (p) 1−q 1−q
Ω
⎧ ⎫
⎨ ⎥ ⎬
1
= 1 − p(x)q r(x)1−q dx
(1 − q)Zq (p) ⎩ ⎭
Ω
q
= D(1−2q) (p, r),
Zq (r)

where D is the canonical divergence of the Hessian manifold (Sq , ⊂ q(e) , gq ).


76 H. Matsuzoe and M. Henmi

3.7 Maximum q-Likelihood Estimators

In this section, we generalize the maximum likelihood method from the viewpoint of
generalized independence. To avoid complicated arguments, we restrict ourselves to
consider the q-exponential case. However, we can generalize it to the χ-exponential
case (cf. [8, 9]).
Let X and Y be random variables which follow probability distributions p1 (x) and
p2 (y), respectively. We say that two random variables X and Y are independent if the
joint probability p(x, y) is decomposed by a product of marginal distributions p1 (x)
and p2 (Y ):
p(x, y) = p1 (x)p2 (y).

When p1 (x) > 0 and p2 (y) > 0, the independence can be written with an exponential
function and a logarithm function by

p(x, y) = exp log p1 (x) + log p2 (x) .

We generalize the notion of independence using the q-exponential and q-logarithm.


Suppose that x > 0, y > 0 and x 1−q + y1−q − 1 > 0 (q > 0). We say that x ⊗q y is
a q-product [6] of x and y if
  1
1−q
x ⊗q y := x 1−q + y1−q − 1

= expq logq x + logq y .

In this case, the following low of exponents holds:

expq x ⊗q expq y = expq (x + y),

in other words,

logq (x ⊗q y) = logq x + logq y.

Let Xi be a random variable on Xi which follows pi (x) (i = 1, 2, . . . , N). We say that


X1 , X2 , . . . , XN are q-independent with m-normalization (mixture normalization) if

p1 (x1 ) ⊗q p2 (x2 ) ⊗q · · · ⊗q pN (xN )


p(x1 , x2 , . . . , xN ) =
Zp1 ,p2 ,··· ,pN

where p(x1 , x2 , . . . , xN ) is the joint probability density of X1 , X2 , . . . , XN and


Zp1 ,p2 ,··· ,pN is the normalization of p1 (x1 ) ⊗q p2 (x2 ) ⊗q · · · ⊗q pN (xN ) defined by
⎥ ⎥
Zp1 ,p2 ,··· ,pN := ··· p1 (x1 ) ⊗q p2 (x2 ) ⊗q · · · ⊗q pN (xN )dx1 · · · dxN .
X1 ···XN
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 77

Let Sq = {p(x; ξ)|ξ ∈ Ξ } be a q-exponential family, and let {x1 , . . . , xN } be


N-observations from p(x; ξ) ∈ Sq . We define a q-likelihood function Lq (ξ) by

Lq (ξ) = p(x1 ; ξ) ⊗q p(x2 ; ξ) ⊗q · · · ⊗q p(xN ; ξ).

Equivalently, a q-log-likelihood function is given by


N
logq Lq (ξ) = logq p(xi ; ξ).
i=1

In the case q ◦ 1, Lq is the standard likelihood function on Ξ .


The maximum q-likelihood estimator ξˆ is the maximizer of the q-likelihood func-
tions, which is defined by
 
ξˆ := argmax Lq (ξ) = argmax logq Lq (ξ) .
ξ∈Ξ ξ∈Ξ

Let us consider geometry of maximum q-likelihood estimators. Let Sq be a


q-exponential family. Suppose that {x1 , . . . , xN } are N-observations generated from
p(x; θ) ∈ Sq .
The q-log-likelihood function is calculated as
 n 

N 
N 
logq Lq (θ) = logq p(xj ; θ) = θ Fi (xj ) − ψ(θ)
i

j=1 j=1 i=1


n 
N
= θi Fi (xj ) − Nψ(θ).
i=1 j=1

The q-log-likelihood equation is


N
∂i logq Lq (θ) = Fi (xj ) − N∂i ψ(θ) = 0.
j=1

Thus, the maximum q-likelihood estimator for η is given by

1 
N
η̂i = Fi (xj ).
N
j=1

On the other hand, the canonical divergence for (Sq , ⊂ q(e) , gq ) can be calculated
as
78 H. Matsuzoe and M. Henmi

DqT (p(η̂), p(θ)) = D(p(θ), p(η̂))



n
= ψ(θ) + φ(η̂) − θi η̂i
i=1
1
= φ(η̂) − logq Lq (θ).
N
This implies that the q-likelihood attains the maximum if and only if the normalized
Tsallis relative entropy attains the minimum.
Let M be a curved q-exponential family in Sq , that is, M is a submanifold in Sq and
is a statistical model itself. Suppose that {x1 , . . . , xN } are N-observations generated
from p(x; u) = p(x; θ(u)) ∈ M. The above arguments implies that the maximum q-
likelihood estimator for M is given by the orthogonal projection of data with respect
to the normalized Tsallis relative entropy.
We remark that the maximum q-likelihood estimator can be generalized by
U-geometry. (See [8, 9] by Fujimoto and Murata.) However, their approach and
ours are slightly different. They applied the χ-divergence (U-divergence) projection
for a parameter estimation, whereas we applied the generalized relative entropy. As
we discussed in this paper, the induced Hessian structures from those divergences
are different.

3.8 Conclusion

In this paper, we considered two Hessian structures from the viewpoints of the stan-
dard expectation and the χ-expectation. Though the former and the later are known as
U-geometry ([21, 26]) and χ-geometry ([3]), respectively, they turn out to be different
Hessian structures in the same deformed exponential family through a comparison
of each other.
We note that, from the viewpoint of estimating functions, the former is geometry
of bias-corrected χ-score functions with the standard expectation, whereas the later
is geometry of unbiased χ-score functions with the χ-expectation.
As an application to statistics, we considered generalization of maximum like-
lihood method for q-exponential family. We used the normalized Tsallis relative
entropy for orthogonal projection, whereas the previous results used χ-divergences
of Bregman type.

Acknowledgments The authors would like to express their sincere gratitude to the anonymous
reviewers for constructive comments for preparation of this paper. The first named author is partially
supported by JSPS KAKENHI Grant Number 23740047.
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 79

References

1. Amari, S., Nagaoka, H.: Method of Information Geometry. American Mathematical Society,
Providence, Oxford University Press, Oxford (2000)
2. Amari, S., Ohara, A.: Geometry of q-exponential family of probability distributions. Entropy
13, 1170–1185 (2011)
3. Amari, S., Ohara, A., Matsuzoe, H.: Geometry of deformed exponential families: invariant,
dually-flat and conformal geometry. Phys. A. 391, 4308–4319 (2012)
4. Barondorff-Nielsen, O.E., Jupp, P.E.: Statistics, yokes and symplectic geometry. Ann. Facul.
Sci. Toulouse 6, 389–427 (1997)
5. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimising
a density power divergence. Biometrika 85, 549–559 (1998)
6. Borgesa, E.P.: A possible deformed algebra and calculus inspired in nonextensive thermosta-
tistics. Phys. A 340, 95–101 (2004)
7. Eguchi, S.: Geometry of minimum contrast. Hiroshima Math. J. 22, 631–647 (1992)
8. Fujimoto, Y., Murata, N.: A generalization of independence in naive bayes model. Lect. Notes
Comp. Sci. 6283, 153–161 (2010)
9. Fujimoto Y., Murata N.: A generalisation of independence in statistical models for categorical
distribution. Int. J. Data Min. Model. Manage. 2(4), 172–187 (2012)
10. Ivanov, S.: On dual-projectively flat affine connections. J. Geom. 53, 89–99 (1995)
11. Kumon, M., Takemura, A., Takeuchi, K.: Conformal geometry of statistical manifold with
application to sequential estimation. Sequential Anal. 30, 308–337 (2011)
12. Kurose, T.: On the divergences of 1-conformally flat statistical manifolds. Tôhoku Math. J. 46,
427–433 (1994)
13. Kurose, T.: Conformal-projective geometry of statistical manifolds. Interdiscip. Inform. Sci.
8, 89–100 (2002)
14. Lauritzen, S. L.: Statistical Manifolds, Differential Geometry in Statistical Inferences, IMS
Lecture Notes Monograph Series, vol. 10, pp. 96–163. Hayward, California (1987)
15. Matsuzoe, H.: Geometry of contrast functions and conformal geometry. Hiroshima Math. J.
29, 175–191 (1999)
16. Matsuzoe, H.: Geometry of statistical manifolds and its generalization. In: Proceedings of the
8th International Workshop on Complex Structures and Vector Fields, pp. 244–251. World
Scientific, Singapore (2007)
17. Matsuzoe, H.: Computational geometry from the viewpoint of affine differential geometry.
Lect. Notes Comp. Sci. 5416, 103–113 (2009)
18. Matsuzoe, H.: Statistical manifolds and geometry of estimating functions, pp. 187–202. Recent
Progress in Differential Geometry and Its Related Fields World Scientific, Singapore (2013)
19. Matsuzoe, H., Henmi, M.: Hessian structures on deformed exponential families. Lect. Notes
Comp. Sci. 8085, 275–282 (2013)
20. Matsuzoe, H., Ohara, A.: Geometry for q-exponential families. In: Recent progress in differ-
ential geometry and its related fields, pp. 55–71. World Scientific, Singapore (2011)
21. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of u-boost and
bregman divergence. Neural Comput. 16, 1437–1481 (2004)
22. Naudts, J.: Estimators, escort probabilities, and φ-exponential families in statistical physics. J.
Ineq. Pure Appl. Math. 5, 102 (2004)
23. Naudts, J.: Generalised Thermostatistics, Springer, New York (2011)
24. Ohara, A.: Geometric study for the legendre duality of generalized entropies and its application
to the porous medium equation. Euro. Phys. J. B. 70, 15–28 (2009)
25. Ohara, A., Matsuzoe H., Amari S.: Conformal geometry of escort probability and its applica-
tions. Mod. Phys. Lett. B. 10, 26:1250063 (2012)
26. Ohara A., Wada, T.: Information geometry of q-Gaussian densities and behaviors of solutions
to related diffusion equations. J. Phys. A: Math. Theor. 43, 035002 (2010)
27. Okamoto, I., Amari, S., Takeuchi, K.: Asymptotic theory of sequential estimation procedures
for curved exponential families. Ann. Stat. 19, 961–961 (1991)
80 H. Matsuzoe and M. Henmi

28. Shima, H.: The Geometry of Hessian Structures, World Scientific, Singapore (2007)
29. Suyari, H., Tsukada, M.: Law of error in tsallis statistics. IEEE Trans. Inform. Theory 51,
753–757 (2005)
30. Takatsu, A.: Behaviors of ϕ-exponential distributions in wasserstein geometry and an evolution
equation. SIAM J. Math. Anal. 45, 2546–2546 (2013)
31. Tanaka, M.: Meaning of an escort distribution and τ -transformation. J. Phys.: Conf. Ser. 201,
012007 (2010)
32. Tsallis, C.: Possible generalization of boltzmann—gibbs statistics. J. Stat. Phys. 52, 479–487
(1988)
33. Tsallis, C.: Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World.
Springer, New York (2009)
34. Vigelis, R.F., Cavalcante, C.C.: On φ-families of probability distributions. J. Theor. Probab.
21, 1–25 (2011)
Chapter 4
Harmonic Maps Relative to α-Connections

Keiko Uohashi

Abstract In this paper, we study harmonic maps relative to α-connections, but not
necessarily relative to Levi-Civita connections, on Hessian domains. For the purpose,
we review the standard harmonic map and affine harmonic maps, and describe the
conditions for harmonicity of maps between level surfaces of a Hessian domain in
terms of the parameter α and the dimension n. To illustrate the theory, we describe
harmonic maps between the level surfaces of convex cones.

4.1 Introduction

Harmonic maps are important objects in certain branches of geometry and physics.
Geodesics on Riemannian manifolds and holomorphic maps between Kähler man-
ifolds are typical examples of harmonic maps. In addition a harmonic map has a
variational characterization by the energy of smooth maps between Riemannian
manifolds and several existence theorems for harmonic maps are already known. On
the other hand the notion of a Hermitian harmonic map from a Hermitian manifold
to a Riemannian manifold was introduced and investigated by [4, 8, 10]. It is not
necessary a harmonic map if the domain Hermitian manifold is non-Kähler. The sim-
ilar results are pointed out for affine harmonic maps, which is analogy to Hermitian
harmonic maps [7].
Statistical manifolds have mainly been studied in terms of their affine geome-
try, information geometry, and statistical mechanics [1]. For example, Shima estab-
lished conditions for harmonicity of gradient mappings of level surfaces on a Hessian
domain, which is a typical example of a dually flat statistical manifold [14]. Level
surfaces on a Hessian domain are known as 1- and (−1)-conformally flat statistical

K. Uohashi (B)
Department of Mechanical Engineering and Intelligent Systems, Faculty of Engineering,
Tohoku Gakuin University, 1-13-1 Chuo, Tagajo, Miyagi 985-8537, Japan
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 81


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_4,
© Springer International Publishing Switzerland 2014
82 K. Uohashi

manifolds for primal and dual connections, respectively [17, 19]. The gradient
mappings are then considered to be harmonic maps relative to the dual connection,
i.e., the (−1)-connection [13].
In this paper, we review the notions of harmonic maps, affine harmonic maps
and α-affine harmonic maps, and investigate different kinds of harmonic maps rela-
tive to α-connections. In Sect. 4.2, we give definitions of an affine harmonic map, a
harmonic map and the standard Laplacian. In Sect. 4.3, we explain the generalized
Laplacian which defines a harmonic map relative to an affine connection. In Sect. 4.4,
we present the Laplacian of a gradient mapping on a Hessian domain, as an example
of the generalized Laplacian. Moreover, we compare the harmonic map defined by
Shima with an affine harmonic map defined in Sect. 4.2. In Sect. 4.5, α-connections
of statistical manifolds are explained. In Sect. 4.6, we define α-affine harmonic maps
which are generalization of affine harmonic maps and also a generalization of har-
monic maps defined by Shima. In Sect. 4.7, we describe the α-conformal equiva-
lence of statistical manifolds and a harmonic map relative to two α-connections. In
Sect. 4.8, we review α-conformal equivalence of level surfaces of a Hessian domain.
In Sect. 4.9, we study harmonic maps of level surfaces relative to two α-connections,
for examples of a harmonic map in Sect. 4.7, and provide examples on level surfaces
of regular convex cones.
Shima [13] investigated harmonic maps of n-dimensional level surfaces into
an (n + 1)-dimensional dual affine space, rather than onto other level surfaces.
Although Nomizu and Sasaki calculated the Laplacian of centro-affine immer-
sions into an affine space, which generate projectively flat statistical manifolds
(i.e. (−1)-conformally flat statistical manifolds), they did not discuss any harmonic
maps between two centro-affine hypersurfaces [12]. Then, we study harmonic maps
between hypersurfaces with the same dimension relative to general α-connections
that may not satisfy α = −1 or 0 (where the 0-connection implies the Levi-Civita
connection). In particular, we demonstrate the existence of non-trivial harmonic maps
between level surfaces of a Hessian domain with α-parameters and the dimension n.

4.2 Affine Harmonic Maps and Harmonic Maps

First, we recall definitions of an affine harmonic map and a harmonic map.


Let M an m-dimensional affine manifold and {x 1 , . . . , x m } a local affine coordinate
system of M. If there exist a symmetric tensor field of degree 2

g = gij dx i dx j

on M satisfying locally
∂2ϕ
gij = (4.1)
∂x i ∂x j
4 Harmonic Maps Relative to α-Connections 83

for a convex function ϕ, M is said to be a Kähler affine manifold [2, 7]. A matrix [gij ] is
positive definite and defines a Riemannian metric. Then for the Kähler affine manifold
M, (M, D, g) is a Hessian manifold, where D is a canonical flat affine connection for
{x 1 , . . . , x m }. We will mention details of Hessian manifolds and Hessian domains in
later sections of this paper.
The Kähler affine structure (4.1) defines an affinely invariant operator L by


m
∂2
L= gij . (4.2)
∂x i ∂x j
i,j=1

A smooth function f : M ⊂ R is said to be affine harmonic if

Lf = 0.

For a Kähler affine manifold (M, g) and a Riemannian manifold (N, h), a smooth
map φ : M ⊂ N is said to be affine harmonic if
 ⎛

m
∂ 2 φγ 
n
γ ∂φ δ ∂φβ
g ij  i j + Γˆδβ i ⎝ = 0, γ = 1, . . . , n, (4.3)
∂x ∂x ∂x ∂x j
i,j=1 δ,β=1

where Γˆ is the Christoffel symbol of the Levi-Civita connection for a Riemannian


metric h, and n = dim N.
Let us compare an affine harmonic map with a harmonic map. For this purpose, we
give a definition of a harmonic function at first. For a Riemannian manifold (M, g),
a smooth function f : M ⊂ R is said to be a harmonic function if

Δf = 0,

where Δ is the standard Laplacian, i.e.,


⎞ ⎠
1  ∂ ∗ ij ∂f
m
Δf = div grad f = ∗ gg (4.4)
g ∂x i ∂x j
i,j=1
⎧ ⎨

m
∂2f m
k ∂f
= g ij
− Γij k (4.5)
∂x i ∂x j ∂x
i,j=1 k=1
 m
= {ei (ei f ) − (∈eLC
i
ei )f },
i=1
g = det[gij ],
84 K. Uohashi

{e1 , . . . , em } is a local orthogonal frame on a neighborhood of x ∈ M, and ∈ LC , Γ


are the Levi-Civita connection, the Christoffel symbol of ∈ LC , respectively. Remark
that the sign of definition (4.4) is inverse to the sign of the Laplacian in [3, 21].
For Riemannian manifolds (M, g), (N, h), a smooth map φ : M ⊂ N is said to
be a harmonic map if

τ (φ) ≡ 0; the Euler-Lagrange equation,

where τ (φ) ∈ Γ (φ−1 TN) is the standard tension field of φ defined by


m
τ (φ)(x) = (∈˜ eLC
i
φ∇ ei − φ∇ ∈eLC
i
ei )(x), x ∈ M, (4.6)
i=1
∈˜ eLC
i
φ∇ ei = ∈ˆ φLC φ e;
∇ ei ∇ i
the pull-back connection,

and ∈ LC , ∈ˆ LC are the Levi-Civita connections for g, h, respectively. For local coor-
dinate systems {x 1 , . . . , x m } and {y1 , . . . , yn } on M and N, the γ-th component of
τ (φ) at x ∈ M is described by
⎩ ⎫

m ⎤ ∂ 2 φγ 
m
∂φ γ 
n
∂φ ∂φ ⎬
δ β
τ (φ)γ (x) = g ij − Γ k
(x) + Γˆ γ (φ(x))
⎥ ∂x i ∂x j ij δβ ∂x i ∂x j ⎭
∂x k
i,j=1 k=1 δ,β=1
(4.7)

m 
n
γ ∂φδ ∂φβ
= Δφγ + g ij Γˆδβ (φ(x)) ,
∂x i ∂x j
i,j=1 δ,β=1

φδ = yδ ◦ φ, γ = 1, . . . , n,

where

n

τ (φ)(x) = τ (φ)γ (x) ,
∂yγ
γ=1

γ
and Γijk , Γˆδβ are the Christoffel symbols of ∈ LC , ∈ˆ LC , respectively. The original
definition of a harmonic map is described in [3, 21], and so on.
Remark 1 Term (4.5) is not equal to the definition (4.2). Hence an affine harmonic
function is not necessary a harmonic function.

Remark 2 Term (4.7) is not equal to the definition (4.3). Hence an affine harmonic
map is not necessary a harmonic map.
4 Harmonic Maps Relative to α-Connections 85

4.3 Affine Harmonic Maps and Generalized Laplacians

In Sect. 4.2, the Laplacian is defined for a function on a Riemannian manifold. In


this section, we treat Laplacians for maps between Riemannian manifolds.
For Riemannian manifolds (M, g) and (N, h), a tension field of a smooth map
φ : M ⊂ N is defined by


m
τ (φ) = (∈ˆ ei (φ∇ ei ) − φ∇ (∈eLC
i
ei )) ∈ Γ (φ−1 TN) (4.8)
i=1
m  ⎞ ⎠ ⎞ ⎠
∂ LC ∂
= g ∈ˆ
ij
∂ φ ∇ j − φ∇ ∈ ∂ ,
∂x i ∂x ∂x i ∂x
j
i,j=1

where {e1 , . . . , em } is a local orthonormal frame for g, {x 1 , . . . , x m } is a local coor-


dinate system on M, ∈ LC is the Levi-Civita connection of g, and ∈ˆ is a torsion free
affine connection on N [12]. The affine connection ∈ˆ does not need to be the Levi-
Civita connection. We also denote by ∈ˆ the pull-back connection of ∈ˆ to M. Then
φ is said to be a harmonic map relative to (g, ∈) ˆ if


m
τ (φ) = (∈ˆ ei (φ∇ ei ) − φ∇ (∈eLC
i
ei )) ≡ 0.
i=1

If a Riemannian manifold N is an finite dimensional real vector space V , the


tension field τ (φ) is said to be a Laplacian of a map φ : M ⊂ V . Then a notation Δ
for the standard Laplacian is often used for the Laplacian of a map as the following;

ˆ φ = τ (φ) : M ⊂ V .
Δφ = Δ(g,∈) (4.9)

For V = R, Δφ defined by Eqs. (4.8) and (4.9) coincides with the standard Laplacian
for a function defined by (4.4).
See in [12] for an affine immersion and the Laplacian of a map, and see in [13,
14] for the gradient mapping and the Laplacian on a Hessian domain.

4.4 Gradient Mappings and Affine Harmonic Maps

In this section, we investigate the Laplacian of a gradient mapping in view of geometry


of affine harmonic maps.
Let D be the canonical flat affine connection on an (n + 1)-dimensional real affine
space An+1 and let {x 1 , . . . , x n+1 } be the
⎜canonical affine coordinate system on An+1 ,
n+1
i.e., Ddx = 0. If the Hessian Ddϕ = i,j=1 (∂ ϕ/∂x i ∂x j )dx i dx j of a function ϕ is
i 2
86 K. Uohashi

non-degenerate on a domain Ω in An+1 , then (Ω, D, g = Ddϕ) is a Hessian domain


[14].
For the dual affine space A∇n+1 and the dual affine coordinate system {x1∇ , . . . , xn+1
∇ }

of An+1 , the gradient mapping ι from a Hessian domain (Ω, D, g = Ddϕ) into
(A∇n+1 , D∇ ) is defined by
∂ϕ
xi∇ ◦ ι = − i .
∂x

The dually flat affine connection D→ on Ω is given by

ι∇ (DX→ Y ) = DX∇ ι∇ (Y ) for X, Y ∈ Γ (T Ω), (4.10)

where DX∇ ι∇ (Y ) denotes the covariant derivative along ι induced by the canonical flat
affine connection D∇ on A∇n+1 .
The Laplacian of ι with respect to (g, D∇ ) is given by

  ⎞ ⎞ ⎠⎠
∂ ∂
Δ(g,D∇ ) ι = g ij D∇∂ ι∇ j − ι∇ ∈ LC ∂
∂x i ∂x ∂x i ∂x
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (D→ − ∈ LC ) ∂ (4.11)
⎥ ∂x i ∂x ⎭
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (∈ LC − D) ∂ (4.12)
⎥ ∂x i ∂x ⎭
j
i,j
⎧ ⎨
 ∂
= ι∇ αKi i ,
∂x
i

where ∈ LC is the Levi-Civita connection for g and Γ is the Christoffel symbol of


∈ LC , and where αK is the Koszul form, i.e.,
1  
αK = d log |det[gij ]| 2 , αiK = Γrir , αKi = g ij αjK
r j

([13], p. 93 in [14]). We have the deformation from (4.11) to (4.12) by

D + D→
= ∈ LC .
2
Details of dual affine connections are described in later sections.
Let ∈ LC∇ be the Levi-Civita connection
⎜ for the Hessian metric g ∇ = D∇ dϕ∇ on the
∇ ∇
dual domain Ω = ι(Ω), where ϕ = i x (∂ϕ/∂x i ) − ϕ is the Legendre transform
i

of ϕ. Then the term (4.12) is described as the follows:


4 Harmonic Maps Relative to α-Connections 87

 ∂ ∂
Δ(g,D∇ ) ι = g ij {∈ LC∇
∂ (ι∇ ) − ι∇ (D ∂ )}.
∂x i ∂x j ∂x i ∂x
j
i,j

Moreover the γ-th component of Δ(g,D∇ ) ι is


 ⎛
 ∂ 2 ιγ  γ ∂ιδ ∂ιβ
(Δ(g,D∇ ) ι)γ = g ij  i j + Γδβ i j ⎝ , γ = 1, . . . , n + 1
∂x ∂x ∂x ∂x
i,j δ,β

where ιi (x) = xi∇ ◦ ι(x). Therefore, if the gradient mapping ι is a harmonic map with
respect to (g, D∇ ), i.e., if Δ(g,D∇ ) ι ≡ 0, we have
 ⎛
 ∂ 2 ιγ  γ ∂ιδ ∂ιβ
g ij  + Γδβ ⎝ = 0, γ = 1, . . . , n + 1. (4.13)
∂x i ∂x j ∂x i ∂x j
i,j δ,β

Equation (4.13) is obtained by putting ι on φ of Eq. (4.3). Thus considering a Hessian


domain Ω as a Kähler affine manifold, we have the next proposition.

Proposition 1 The followings are equivalent:

(i) the gradient mapping ι is a harmonic map with respect to (g, D∇ );


(ii) the gradient mapping ι : (Ω, D) ⊂ (A∇n+1 , ∈ LC∇ ) is an affine harmonic map.

In [13, 14], Shima studied an affine harmonic map with the restriction of the
gradient mapping ι to a level surface of a convex function ϕ.
The author does not clearly distinguish a phrase “relative to something” with a
phrase “with respect to something”.

4.5 α-Connections of Statistical Manifolds

We recall some definitions that are essential to the theory of statistical manifolds and
relate α-connections to Hessian domains.
Given a torsion-free affine connection ∈ and a pseudo-Riemannian metric h on a
manifold N, the triple (N, ∈, h) is said to be a statistical manifold if ∈h is symmetric.
If the curvature tensor R of ∈ vanishes, (N, ∈, h) is said to be flat.
Let (N, ∈, h) be a statistical manifold and let ∈ → be an affine connection on N
such that

Xh(Y , Z) = h(∈X Y , Z) + h(Y , ∈X→ Z) for X, Y and Z ∈ Γ (TN),

where Γ (TN) is the set of smooth tangent vector fields on N. The affine connection
∈ → is torsion free and ∈ → h is symmetric. Then ∈ → is called the dual connection of
88 K. Uohashi

∈. The triple (N, ∈ → , h) is the dual statistical manifold of (N, ∈, h), and (∈, ∈ → , h)
defines the dualistic structure on N. The curvature tensor of ∈ → vanishes if and only if
the curvature tensor of ∈ also vanishes. Under these conditions, (∈, ∈ → , h) becomes
a dually flat structure.
Let N be a manifold with a dualistic structure (∈, ∈ → , h). For any α ∈ R, an affine
connection defined by
1+α 1−α →
∈ (α) := ∈+ ∈ (4.14)
2 2

is called an α-connection of (N, ∈, h). The triple (N, ∈ (α) , h) is also a statistical
manifold, and ∈ (−α) is the dual connection of ∈ (α) . The 1-connection ∈ (1) , the
(−1)-connection ∈ (−1) , and the 0-connection ∈ (0) correspond to the ∈, ∈ → , and the
Levi-Civita connection of (N, h), respectively. An α-connection does not need to be
flat.
A Hessian domain is a flat statistical manifold. Conversely, a local region of a
flat statistical manifold is a Hessian domain. For the dual connection D→ defined by
(4.10), (Ω, D→ , g) is the dual statistical manifold of (Ω, D, g) if a Hessian domain
(Ω, D→ , g) is a statistical manifold [1, 13, 14].

4.6 α-Affine Harmonic Maps

In this section, we give a generalization of an affine harmonic map.


Considering a Hessian domain (Ω, D, g) as a statistical manifold, we have the
α-connection of (Ω, D, g) by

1+α 1−α →
D(α) = D+ D
2 2

for each α ∈ R. Let D(α)∇ be an α-connection of (Ω ∇ , D∇ , g ∇ ) which is the dual


statistical manifold of (Ω, D, g). Then the Laplacian of the gradient mapping ι with
respect to (g, D(α)∇ ) is given by

  ⎞ ⎠ ⎞ ⎠
(α)∇ ∂ LC ∂
Δ(g,D(α)∇ ) ι = g D∂
ij
ι ∇ j − ι∇ ∈ ∂
∂x i ∂x ∂x i ∂x
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (D(−α) − ∈ LC ) ∂
⎥ ∂x i ∂x ⎭
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (∈ LC − D(α) ) ∂
⎥ ∂x i ∂x ⎭
j
i,j
4 Harmonic Maps Relative to α-Connections 89

  ⎞ ⎠
∂ (α) ∂
= g ij ∈ LC∇
∂ (ι ∇ ) − ι ∇ D ∂ .
∂x i ∂x j ∂x i ∂x
j
i,j

If Δ(g,D(α)∇ ) ι ≡ 0, we have
⎩ ⎫
 ⎤ ∂ 2 ιγ  ∂ι γ  γ ∂ιδ ∂ιβ ⎬
g ij − (1 − α) Γijk k + Γˆδβ i j = 0,
⎥ ∂x i ∂x j ∂x ∂x ∂x ⎭
i,j k δ,β

γ = 1, . . . , n + 1.

In general, we define the notion of α-affine harmonic maps as follows:

Definition 1 For a Kähler affine manifold (M, g) and a Riemannian manifold (N, h),
a map φ : M ⊂ N is said to be an α-affine harmonic map if
 ⎛
 ∂ 2 φγ  ∂φγ  γ ∂φδ ∂φβ
g ij  − (1 − α) Γijk + Γˆδβ ⎝ = 0, (4.15)
∂x i ∂x j ∂x k ∂x i ∂x j
i,j k δ,β

γ = 1, . . . , dim N.

Then we obtain that the gradient mapping ι is a harmonic map with respect to
(g, D(α)∇ ) if and only if the map ι : (Ω, D(α) ) ⊂ (A∇n+1 , ∈ ∇ ) is an α-affine harmonic
map.
Remark 3 For α = 1, a 1-affine harmonic map is an affine harmonic map.

Remark 4 For α = 0, a 0-affine harmonic map is a harmonic map in the standard


sense.

They are problems to find applications of α-affine harmonic maps and to investi-
gate them.

4.7 Harmonic Maps for α-Conformal Equivalence

In this section, we describe harmonic maps with respect to α-conformal equivalence


of statistical manifolds.
¯ h̄) are regarded
For a real number α, statistical manifolds (N, ∈, h) and (N, ∈,
as α-conformally equivalent if there exists a function φ on N such that
90 K. Uohashi

h̄(X, Y ) = eφ h(X, Y ), (4.16)


1+α
h(∈¯ X Y , Z) = h(∈X Y , Z) − dφ(Z)h(X, Y )
2
1−α
+ {dφ(X)h(Y , Z) + dφ(Y )h(X, Z)} (4.17)
2

for X, Y and Z ∈ Γ (TN). Two statistical manifolds (N, ∈, h) and (N, ∈, ¯ h̄) are
α-conformally equivalent if and only if the dual statistical manifolds (N, ∈ → , h) and
(N, ∈¯ → , h̄) are (−α)-conformally equivalent. A statistical manifold (N, ∈, h) is said
to be α-conformally flat if (N, ∈, h) is locally α-conformally equivalent to a flat
statistical manifold [19].
Let (N, ∈, h) and (N, ∈, ¯ h̄) be α-conformally equivalent statistical manifolds of
dim n ∅ 2, and {x , . . . x } a local coordinate system on N. Suppose that h and h̄
1 n

are Riemannian metrices. We set hij = h(∂/∂x i , ∂/∂x j ) and [hij ] = [hij ]−1 . Let
πid : (N, ∈, h) ⊂ (N, ∈, ¯ h̄) be the identity map, i.e., πid (x) = x for x ∈ N, and πid∇
the differential of πid .
We define a harmonic map relative to (h, ∈, ∈) ¯ as follows:

¯ (πid ) vanishes on N, i.e.,


Definition 2 ([16, 18]) If a tension field τ(h,∈,∈)

¯ (πid ) ≡ 0,
τ(h,∈,∈)

the map πid : (N, ∈, h) ⊂ (N, ∈, ¯ h̄) is said to be a harmonic map relative to
¯ where the tension field is defined by
(h, ∈, ∈),


n  ⎞ ⎠ ⎞ ⎠
∂ ∂ −1
τ(h,∈,∈)
¯ (πid ) : = hij ∈¯ ∂ πid∇ ( j ) − πid∇ ∈ ∂ ∈ Γ (πid TN)
∂x i ∂x ∂x i ∂x
j
i,j=1
 n
∂ ∂
= hij (∈¯ ∂ −∈ ∂ ) ∈ Γ (TN). (4.18)
∂x i ∂x j ∂x i ∂x j
i,j=1

Then the next theorem holds.

Theorem 1 ([16, 18]) For α-conformally equivalent statistical manifolds (N, ∈, h)


¯ h̄) of dim N ∅ 2 satisfying Eqs. (4.16) and (4.17), if α = −(n−2)/(n+2)
and (N, ∈,
or φ is a constant function on N, the identity map πid : (N, ∈, h) ⊂ (N, ∈,¯ h̄) is a
¯
harmonic map relative to (h, ∈, ∈).

Proof By Eqs. (4.17) and (4.18), for k ∈ {1, . . . , n} we have


⎞ ⎠

h τ(h,∈,∈¯ ) (πid ) , k
∂x
 ⎛
n ⎞ ⎠
∂ ∂ ∂
= h hij ∈¯ ∂ −∈ ∂ , k⎝
∂x i ∂x ∂x i ∂x ∂x
j j
i,j=1
4 Harmonic Maps Relative to α-Connections 91


n  ⎞ ⎠ ⎞ ⎠
1+α ∂ ∂ ∂
= h −
ij
dφ h ,
2 ∂x k ∂x i ∂x j
i,j=1
 ⎞ ⎠ ⎞ ⎠
1−α ∂ ∂ ∂
+ dφ h ,
2 ∂x i ∂x j ∂x k
⎞ ⎠ ⎞ ⎠
∂ ∂ ∂
+ dφ h ,
∂x j ∂x i ∂x k
 n  ⎞ ⎠
1 + α ∂φ 1 − α ∂φ ∂φ
= h −
ij
hij + hjk + j hik
2 ∂x k 2 ∂x i ∂x
i,j=1
⎩  ⎛⎫
⎤ 1+α ∂φ 1 − α  ∂φ
n n
∂φ ⎝⎬
= − ·n· k + δik + δjk
⎥ 2 ∂x 2 ∂x i ∂x j ⎭
i=1 j=1
⎞ ⎠
1+α 1−α ∂φ
= − ·n+ ·2
2 2 ∂x k
1 ∂φ
= − {(n + 2) α + (n − 2)} k ,
2 ∂x

where δij is the Kronecker’s delta. Therefore, if τ(h,∈,∈)


¯ (πid ) ≡ 0, it holds that
(n + 2)α + (n − 2) = 0 or ∂φ/∂x k = 0 for all k ∈ {1, . . . , n} at each point in N.
Thus we obtain Theorem 1. ≤

4.8 α-Conformal Equivalence of Level Surfaces

We show our previous results of α-conformal equivalence of level surfaces.


The next theorem holds for a 1-conformally flat statistical submanifold.
Theorem 2 ([19]) Let M be a simply connected n-dimensional level surface of ϕ
on an (n + 1)-dimensional Hessian domain (Ω, D, g = Ddϕ) with a Riemannian
metric g, and suppose that n ∅ 2. If (Ω, D, g) is a flat statistical manifold, then
(M, DM , g M ) is a 1-conformally flat statistical submanifold of (Ω, D, g), where DM
and g M are the connection and the Riemannian metric on M induced by D and g,
respectively.
See in [15, 17–19] for realization problems related with α-conformal equivalence.
We now consider two simply connected level surfaces of dim n ∅ 2 (M, D, g) and
(M̂, D̂, ĝ), which are 1-conformally flat statistical submanifolds of (Ω, D, g). Let λ
be a function on M such that eλ(p) ι(p) ∈ ι̂(M̂) for p ∈ M, where ι̂ is the restriction
of the gradient mapping ι to M̂, and set (eλ )(p) = eλ(p) . Note that the function eλ
projects M to M̂ with respect to the dual affine coordinate system on Ω.
We define a mapping π : M ⊂ M̂ by

ι̂ ◦ π = eλ ι,
92 K. Uohashi

where ι (as denoted above) is the restriction of the gradient mapping ι to M. Let D̄→
be an affine connection on M defined by

π∇ (D̄X→ Y ) = D̂π→ ∇ (X) π∇ (Y ) for X, Y ∈ Γ (TM),

and ḡ be a Riemannian metric on M such that

ḡ(X, Y ) = eλ g(X, Y ) = ĝ(π∇ (X), π∇ (Y )).

The following theorem has been proposed elsewhere (cf. [9, 11]).

Theorem 3 ([20]) For affine connections D→ and D̄→ on M, the following are true:

(i) D→ and D̄→ are projectively equivalent.


(ii) (M, D→ , g) and (M, D̄→ , ḡ) are (−1)-conformally equivalent.

Let D̄ be an affine connection on M defined by

π∇ (D̄X Y ) = D̂π∇ (X) π∇ (Y ) for X, Y ∈ Γ (TM).

From the duality of D̂ and D̂→ , D̄ is the dual connection of D̄→ on M. Then the next
theorem holds (cf. [6, 9]).

Theorem 4 ([20]) For affine connections D and D̄ on M, we have that


(i) D and D̄ are dual-projectively equivalent.
(ii) (M, D, g) and (M, D̄, ḡ) are 1-conformally equivalent.

For α-connections D(α) and D̄(α) = D(−α) defined similarly to (4.14), we obtain
the following corollary by Theorem 3, Theorem 4, and Eq. (4.17) with φ = λ [15].

Corollary 1 For affine connections D(α) and D̄(α) on M, (M, D(α) , g) and (M, D̄(α) ,
ḡ) are α-conformally equivalent.

4.9 Harmonic Maps Relative to α-Connections on Level


Surfaces
(α) (α)
We denote D̂π∇ (X) π∇ (Y ) by D̂X π∇ (Y ), considering it in the inverse-mapped section
Γ (π −1 T M̂). Let {x 1 , . . . , x n } be a local coordinate system on M. The notion of a
harmonic map between two level surfaces (M, D(α) , g) and (M̂, D̂(α) , ĝ) is defined
as follows:

Definition 3 ([16, 18]) If a tension field τ(g,D(α) ,D̂(α) ) (π) vanishes on M, i.e.,

τ(g,D(α) ,D̂(α) ) (π) ≡ 0,


4 Harmonic Maps Relative to α-Connections 93

the map π : (M, D(α) , g) ⊂ (M̂, D̂(α) , ĝ) is said to be a harmonic map relative to
(g, D(α) , D̂(α) ), where the tension field is defined by
 ⎞ ⎠ ⎧ ⎨

n
(α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) := g ij D̂ ∂ π∇ ( j ) − π ∇ D ∂ ∈ Γ (π −1 T M̂).
∂x i
∂x ∂x i ∂x j
i,j=1
(4.19)

We now specify the conditions for harmonicity of a map π : M ⊂ M̂ relative to


(g, D(α) , D̂(α) ).

Theorem 5 ([16, 18]) Let (M, D(α) , g) and (M̂, D̂(α) , ĝ) be simply connected
n-dimensional level surfaces of an (n + 1)-dimensional Hessian domain (Ω, D, g)
with n ∅ 2. If α = −(n − 2)/(n + 2) or λ is a constant function on M, a map
π : (M, D(α) , g) ⊂ (M̂, D̂(α) , ĝ) is a harmonic map relative to (g, D(α) , D̂(α) ),
where
ι̂ ◦ π = eλ ι, (eλ )(p) = eλ(p) , eλ(p) ι(p) ∈ ι̂(M̂), p ∈ M,

and ι, ι̂ are the restrictions of the gradient mappings on Ω to M and M̂, respectively.

Proof The tension field of the map π relative to (g, D(α) , D̂(α) ) is described by the
pull-back of (M̂, D̂(α) , ĝ), namely (M, D̄(α) , ḡ), as follows:


n  ⎞ ⎞ ⎠⎠ ⎞ ⎠
(α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) = g ij D̂ ∂ π∇ − π∇ D ∂
∂x i ∂x j ∂x i ∂x
j
i,j=1
 n  ⎞ ⎠ ⎞ ⎠
(α) ∂ (α) ∂
= g ij π∇ D̄ ∂ − π ∇ D ∂
∂x i ∂x ∂x i ∂x
j j
i,j=1
 ⎛
 n ⎞ ⎠
(α) ∂ (α) ∂
= π∇  g ij D̄ ∂ −D ∂ ⎝
∂x i ∂x ∂x i ∂x
j j
i,j=1

Identifying Tπ(x) M with Tx M and considering the definition of π, we obtain


n ⎞ ⎠
λ (α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) = e g D̄ ∂
ij
−D ∂ .
∂x i ∂x ∂x i ∂x
j j
i,j=1

By Corollary 1, (M, D(α) , g) and (M, D̄(α) , ḡ) are α-conformally equivalent, so that
Eq. (4.17) holds with φ = λ, h = g, ∈ = D(α) , and ∈¯ = D̄(α) for X, Y and
Z ∈ Γ (TM). Thus, for all k ∈ {1, . . . , n},

⎞ ⎠

g τ(g,D(α) ,D̂(α) ) (π) , k
∂x
94 K. Uohashi
 ⎛

n ⎞ ⎠
(α) ∂ (α) ∂ ∂
= g eλ g ij D̄ ∂ −D ∂ , k⎝
∂x i ∂x ∂x i ∂x ∂x
j j
i,j=1

n  ⎞ ⎠ ⎞ ⎠
λ 1+α ∂ ∂ ∂
=e g −ij
dλ g ,
2 ∂x k ∂x i ∂x j
i,j=1
 ⎞ ⎠ ⎞ ⎠
1−α ∂ ∂ ∂
+ dλ g ,
2 ∂x i ∂x j ∂x k
⎞ ⎠ ⎞ ⎠
∂ ∂ ∂
+ dλ g ,
∂x j ∂x i ∂x k
 n  ⎞ ⎠
λ 1 + α ∂λ 1 − α ∂λ ∂λ
=e g −
ij
gij + gjk + j gik
2 ∂x k 2 ∂x i ∂x
i,j=1
⎩  ⎛⎫
⎤ 1+α ∂λ 1 − α n
∂λ n
∂λ ⎬
= eλ − ·n· k +  δ ik + δ jk ⎝
⎥ 2 ∂x 2 ∂x i ∂x j ⎭
i=1 j=1
⎞ ⎠
1+α 1−α ∂λ
= − ·n+ · 2 eλ k
2 2 ∂x
1 ∂λ
= − {(n + 2) α + (n − 2)} eλ k .
2 ∂x

Therefore, if τ(g,D(α) ,D̂(α) ) (π) ≡ 0, then (n + 2)α + (n − 2) = 0 or ∂λ/∂x k = 0 for


all k ∈ {1, . . . , n} at each point in N. Thus we obtain Theorem 5. ≤

Remark 5 If n = 2, harmonic maps π with non-constant functions λ exist if and


only if α = 0.

Remark 6 If n ∅ 3, and a map π is a harmonic map with a non-constant function


λ, then −1 < α < 0.

Remark 7 For α ≤ −1 and α > 0, harmonic maps π with non-constant functions λ


do not exist.

Definition 3 and Theorem 5 are special cases of harmonic maps between


α-conformally equivalent statistical manifolds discussed in our previous study [16].
We now provide specific examples of harmonic maps between level surfaces
relative to α-connections.

Example 1 (Regular convex cone) Let Ω and ψ be a regular convex cone and its
characteristic function, respectively. On the Hessian domain (Ω, D, g = Dd log ψ),
d log ψ is invariant under a 1-parameter group of dilations at the vertex p of Ω, i.e.,
x −⊂ et (x − p) + p, t ∈ R [5, 14]. Then, under these dilations, each map between
level surfaces of log ψ is also a dilated map in the dual coordinate system. Hence,
each dilated map between level surfaces of log ψ in the primal coordinate system is
a harmonic map relative to an α-connection for any α ∈ R.
4 Harmonic Maps Relative to α-Connections 95

Example 2 (Symmetric cone) Let Ω and ψ = Det be a symmetric cone and its
characteristic function, respectively, where Det is the determinant of the Jordan
algebra that generates the symmetric cone. Then, similar to Example 1, each dilated
map at the origin between level surfaces of log ψ on the Hessian domain (Ω, D, g =
Dd log ψ) is a harmonic map relative to an α-connection for any α ∈ R
It is an important problem to find applications of non-trivial harmonic maps rel-
ative to α-connections.

Acknowledgments The author thanks the referees for their helpful comments.

References

1. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society,
Providence, Oxford University Press, Oxford (2000)
2. Cheng, S.Y., Yau, S.T.: The real Monge-Ampère equation and affine flat structures. In: Chern,
S.S., Wu, W.T. (eds.) Differential Geometry and Differential Equations, Proceedings of the
1980 Beijing Symposium, Beijing, pp. 339–370 (1982)
3. Eelles, J., Lemaire, L.: Selected Topics in Harmonic Maps. American Mathematical Society,
Providence (1983)
4. Grunau, H.C., Kühnel, M.: On the existence of Hermitian-harmonic maps from complete
Hermitian to complete Riemannian manifolds. Math. Z. 249, 297–327 (2005)
5. Hao, J.H., Shima, H.: Level surfaces of non-degenerate functions in r n+1 . Geom. Dedicata 50,
193–204 (1994)
6. Ivanov, S.: On dual-projectively flat affine connections. J. Geom. 53, 89–99 (1995)
7. Jost, J., Şimşir, F.M.: Affine harmonic maps. Analysis 29, 185–197 (2009)
8. Jost, J., Yau, S.T.: A nonlinear elliptic system for maps from Hermitian to Riemannian manifolds
and rigidity theorems in Hermitian geometry. Acta Math. 170, 221–254 (1993)
9. Kurose, T.: On the divergence of 1-conformally flat statistical manifolds. Tôhoku Math. J. 46,
427–433 (1994)
10. Ni, L.: Hermitian harmonic maps from complete Hermitian manifolds to complete Riemannian
manifolds. Math. Z. 232, 331–335 (1999)
11. Nomizu, K., Pinkal, U.: On the geometry and affine immersions. Math. Z. 195, 165–178 (1987)
12. Nomizu, K., Sasaki, T.: Affine Differential Geometry: Geometry of Affine Immersions. Cam-
bridge University Press, Cambridge (1994)
13. Shima, H.: Harmonicity of gradient mapping of level surfaces in a real affine space. Geom.
Dedicata 56, 177–184 (1995)
14. Shima, H.: The Geometry of Hessian Structures. World Scientific Publishing, Singapore (2007)
15. Uohashi, K.: On α-conformal equivalence of statistical submanifolds. J. Geom. 75, 179–184
(2002)
16. Uohashi, K.: Harmonic maps relative to α-connections on statistical manifolds. Appl. Sci. 14,
82–88 (2012)
17. Uohashi, K.: A Hessian domain constructed with a foliation by 1-conformally flat statistical
manifolds. Int. Math. Forum 7, 2363–2371 (2012)
18. Uohashi, K.: Harmonic maps relative to α-connections on Hessian domains. In: Nielsen, F.,
Barbaresco, F. (eds.) Geometric Science of Information, First International Conference, GSI
2013, Paris, France, 28–30 August 2013. Proceedings, LNCS, vol. 8085, pp. 745–750. Springer,
Heidelberg (2013)
19. Uohashi, K., Ohara, A., Fujii, T.: 1-Conformally flat statistical submanifolds. Osaka J. Math.
37, 501–507 (2000)
96 K. Uohashi

20. Uohashi, K., Ohara, A., Fujii, T.: Foliations and divergences of flat statistical manifolds.
Hiroshima Math. J. 30, 403–414 (2000)
21. Urakawa, H.: Calculus of Variations and Harmonic Maps. Shokabo, Tokyo (1990) (in Japanese)
Chapter 5
A Riemannian Geometry in the q-Exponential
Banach Manifold Induced by q-Divergences

Héctor R. Quiceno, Gabriel I. Loaiza and Juan C. Arango

Abstract In this chapter we consider a deformation of the nonparametric exponential


statistical models, using the Tsalli’s deformed exponentials, to construct a Banach
manifold modelled on spaces of essentially bounded random variables. As a result of
the construction, this manifold recovers the exponential manifold given by Pistone
and Sempi up to continuous embeddings on the modeling space. The q-divergence
functional plays two important roles on the manifold; on one hand, the coordinate
mappings are in terms of the q-divergence functional; on the other hand, this func-
tional induces a Riemannian geometry for which the Amari’s α-connections and the
Levi-Civita connections appears as special cases of the q-connections induced, ⊂(q) .
The main result is the flatness (zero curvature) of the manifold.

5.1 Introduction

The study of differential geometry structure of statistical models, specially its


Riemannian structure, has been developed in two ways, the finite and infinite dimen-
sional cases. The former due to Rao [20], and Jeffreys [11], where the Fisher informa-
tion is given as a metric for a parametric statistical model { p(x, θ), θ = (θ1 , . . . , θn )}
together with the non-flat Levi-Civita connection. Efron [7], introduced the concept
of statistical curvature and implicitly used a new connection, known as the expo-
nential connection which was deeply studied by Dawid [6]. The study on the finite
dimensional case culminated with Amari [1] who defined a one-parameter family of
α-connections which specializing to the exponential connection when α ∗ 1, the
essential concept of duality, and the notions of statistical divergence among others.
This geometry is characterized by the facts that the exponential connection for the
exponential family has vanishing components and zero curvature.

H. R. Quiceno (B) · G. I. Loaiza · J. C. Arango


Universidad Eafit, Carrera 49 N∈ 7 Sur-50, Medellin, Colombia, Suramérica
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 97


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_5,
© Springer International Publishing Switzerland 2014
98 H. R. Quiceno et al.

The infinite-dimensional case, was initially developed by Pistone and Sempi [19],
constructing a manifold for the exponential family with the use of Orlicz spaces as
coordinate space. Gibilisco and Pistone [10] defined the exponential connection as the
natural connection induced by the use of Orlicz spaces and show that the exponential
and mixture connections are in duality relationship with the α-connections just as in
the parametric case. In these structures, every point of the manifold is a probability
density on some sample measurable space, with particular focus on the exponential
family since almost every model of the non-extensive physics belongs to this family.
Nevertheless, some models (complex models, see [4]) do not fit in this representations
so a new family of distributions must be defined to contain those models, one of
these families is the q-exponential family based on a deformation of the exponential
function which has been used in several applications using the Tsalli’s index q, for
references see [4, 21].
In order to give a geometric structure for this models, Amari and Ohara [3] studied
the geometry of the q-exponential family in the finite dimensional setting and they
found this family to have a dually flat geometrical structure derived from Legendre
transformation and is understood by means of conformal geometry. In 2013, Loaiza
and Quiceno [15] constructed for this family, a non-parametric statistical manifold
modeled on essentially bounded function spaces, such that each q-exponential para-
metric model is identified with the tangent space and the coordinate maps are natu-
rally defined in terms of relative entropies in the context of Tsallis; and in [14], the
Riemannian structure is characterized.
The manifold constructed in [14, 15]; is characterized by the fact that when
q ∗ 1 then the non-parametric exponential models are obtained and the mani-
fold constructed by Pistone and Sempi, is recovered, up to continuous embeddings
on the modeling space, which means that the manifolds are related by some map
L Δ ( p · μ) ↔ L ≡ ( p · μ); which should be investigated.

5.2 q-Exponential Functions and q-Exponential Families

As mentioned, some complex phenomena do not fit the Gibbs distribution but the
power law, the Tsalli’s q-entropy is an example capturing such systems. Based on
the Tsalli’s entropy index q, it has been constructed a generalized exponential family
named q-exponential family, which is being presented, see [4, 16, 21].
Given a real number q, we consider the q-deformed exponential and logarithmic
functions which are respectively defined by

1 −1
eqx = [1 + (1 − q)x] 1−q if ∇x (5.1)
1−q

x 1−q − 1
lnq (x) = if x > 0. (5.2)
1−q
5 A Riemannian Geometry in the q-Exponential Banach Manifold 99

Fig. 5.1 expq (x) for q < 0

The above functions satisfy similar properties of the natural exponential and log-
arithmic functions (Fig. 5.1).

5.2.1 q-Deformed Properties and Algebra

It is necessary to show the basic properties of the q-deformed exponential and loga-
rithm functions. Take the definitions given in (5.1) and (5.2) (Fig. 5.2).

Proposition 1 1. For q < 0, x ◦ [0, ≡), expq (x) is positive, continuous, increas-
ing, concave and such that:

lim expq (x) = ≡.


x∗≡

2. For 0 < q < 1, x ◦ [0, ≡), expq (x) is positive, continuous, increasing, convex
and such that:
lim expq (x) = ≡.
x∗≡
 
3. For 1 < q, x ◦ 0, q−1
1
, expq (x) is positive, continuous, increasing, convex
and such that:
lim  expq (x) = ≡.

1
x∗ q−1

Some graphics are shown for different values of the index q, to illustrate the behavior
of expq (x) (Fig. 5.3).
100 H. R. Quiceno et al.

Fig. 5.2 expq (x) for 0 < q < 1

Fig. 5.3 expq (x) for q > 1


5 A Riemannian Geometry in the q-Exponential Banach Manifold 101

Fig. 5.4 lnq (x) for q < 0

Proposition 2 For the function lnq (x)


1. For q < 0, x ◦ [0, ≡), lnq (x) is continuous, increasing, convex and such that:

lim lnq (x) = ≡.


x∗≡

2. For 0 < q < 1, x ◦ [0, ≡), lnq (x) is continuous, increasing, concave and such
that:
lim lnq (x) = ≡.
x∗≡

3. For 1 < q, x ◦ (0, ≡), lnq (x) is increasing, continuous, concave and such that:

1
lim lnq (x) = .
x∗≡ q −1

Some graphics are shown for different values of the index q, to illustrate the behavior
of lnq (x) (Fig. 5.4).
The following proposition shows that the deformed functions, share similar prop-
erties to the natural ones (Fig. 5.5).

Proposition 3 1. Product

expq (x) expq (y) = expq (x + y + (1 − q)x y). (5.3)


102 H. R. Quiceno et al.

Fig. 5.5 lnq (x) for 0 < q < 1

2. Quotient  
expq (x) x−y
= expq . (5.4)
expq (y) 1 + (1 − q)y

3. Power law
(expq x)n = exp1− (1−q) (nx). (5.5)
n

4. Inverse  
 −1 −x
expq (x) = expq = exp2−q (−x). (5.6)
1 + (1 − q)x

5. Derivative
d  
expq (x) = (expq (x))q = exp2− 1 (q x). (5.7)
dx q

6. Integral
1
expq (nx)d x = (expq (nx))2−q . (5.8)
(2 − q)n

7. Product
lnq (x y) = lnq (x) + lnq (y) − (1 − q) lnq (x) lnq (y). (5.9)

8. Quotient  
x lnq (x) − lnq (y)
lnq = . (5.10)
y 1 + (1 − q) lnq (y)
5 A Riemannian Geometry in the q-Exponential Banach Manifold 103

Fig. 5.6 lnq (x) for q > 1

9. Power law n
lnq (x n ) = ln1−n (x 1−q ). (5.11)
1−q

10. Inverse   lnq (x) −1


lnq x −1 = = 1−q lnq (x). (5.12)
1 + (1 − q) lnq (x) x

11. Derivative
d   1
lnq (x) = q . (5.13)
dx x
12. Integral
x(lnq (x) − 1)
lnq (x)d x = . (5.14)
2−q

5.2.2 q-Algebra

For real numbers, there are two binary operations given in terms of the index q, as
follows (Fig. 5.6).
1. The q-sum →q : R2 ∗ R, is given by:

x →q y = x + y + (1 − q)x y. (5.15)

2. The q-product ∅q : R+ × R+ ∗ R, is given by:

1
x ∅q y = [x 1−q + y 1−q − 1]+1−q x > 0 and y > 0. (5.16)
104 H. R. Quiceno et al.

Proposition 4 The above operations are associative, modulative and commutative,


being 0 and 1 the respective modules.

Proposition 5 Let x ←= q−1


1
, then there exist an unique inverse b, for →q ; denoted
by b = ≤q x and given by b = − 1+(1−q)x
x
.

Then
x →q (≤q x) = 0.

With this definition, the q-difference can be defined as:

y (1 − q)x y
x ≤q y = x →q (≤q y) = x − −
1 + (1 − q)y 1 + (1 − q)y
x−y
= . (5.17)
1 + (1 − q)y

And so, the following properties follow:

x ≤q y = ≤q y →q x. (5.18)

x ≤q (y ≤q z) = (x ≤q y) →q z. (5.19)

Proposition 6 For x > 0 exist an unique inverse for ∅q , denoted qx and given
1
by b = (2 − x 1−q ) 1−q .

With the previous proposition, it is defined a q-quotient for reals x y, positive,


such that x 1−q ∇ y 1−q + 1, given by:
1
x q y = x ∅q ( q y) = (x 1−q + (2 − y 1−q )+ − 1) 1−q
1
= (x 1−q − y 1−q + 1)+1−q . (5.20)

Note that for 0 < x ∇ 2 it follows 1 q (1 q x) = x; and if q < 1, the expression


1 q 0 is convergent.

x q y=1 q (y q x) if x 1−q ∇ y 1−q + 1. (5.21)

x q (y q z) = (x q y) ∅q z = (x ∅q z) q y
if z 1−q
− 1 ∇ y 1−q ∇ x 1−q + 1. (5.22)

This definitions among with Proposition 5.3, allow to prove the next proposition.
5 A Riemannian Geometry in the q-Exponential Banach Manifold 105

Proposition 7 1.
expq (x) expq (y) = expq (x →q y). (5.23)

expq (x)
= expq (x ≤q y). (5.24)
expq (y)

expq (x) ∅q expq (y) = expq (x + y). (5.25)

expq (x) q expq (y) = expq (x − y). (5.26)

2.
lnq (x y) = lnq (x) →q lnq (y). (5.27)
 
x
lnq = lnq (x) ≤q lnq (y). (5.28)
y

lnq (x ∅q y) = lnq (x) + lnq (y). (5.29)

lnq (x q y) = lnq (x) − lnq (y). (5.30)

5.2.3 The q-Exponential Family

Some interesting models of statistical physics can be written into the following form,

f β (x) = c(x) expq (−α(β) − β H (x)), (5.31)

readers interested in this examples must see [16]. If the q-exponential in the r.h.s.
diverges then f β (x) = 0 is assumed. The function H (x) is the Hamiltonian. The
parameter β is usually the inverse temperature. The normalization α(β) is written
inside the q-exponential. The function c(x) is the prior distribution. It is a reference
measure and must not depend on the parameter β.
If a model is the above form then it is said to belong to the q-exponential family.
In the limit q ∗ 1 these are the models of the standard exponential family. In that
case the expression (5.31) reduces to

f β (x) = c(x) exp(−α(β) − β H (x))

which is known as the Boltzman-Gibbs distribution.


The convention that f β (x) = 0 when the r.h.s. of (5.31) diverges may seem weird.
However, one can argue that this is the right thing to do, so the integral of the model is
always finite. Also, the example of the harmonic oscillator, given below, will clarify
this point. A reformulation of (5.31) is therefore that either f β (x) = 0 or
106 H. R. Quiceno et al.
 
f β (x)
lnq = −α(β) − β H (x).
c(x)

Models belonging to the such a family share a number of interesting properties.


In particular, they all fit into the thermodynamic formalism. As a consequence, the
probability density f β (x) may be considered to the be the equilibrium distribution of
the model at the given value of the parameter β. Some basic examples of this family
are developed in [16], as follows.

5.2.3.1 The q-Gaussian Distribution (q < 3)

Many of the distributions encountered in the literature on non-extensive statistical


can be brought into the form (5.31). A prominent model encountered in this context
is the q-Gaussian distribution
 2
1 x
f (x) = expq − 2 (5.32)
cq σ σ

with

cq = expq (−x 2 ) d x
−≡
 
π Φ − 2 + q−1
1 1
=   if 1 < q < 3,
q −1 Φ q−1 1
 
π Φ 1 + q−1
1
=   if q < 1.
q −1 Φ 3 + 1
2 1−q

Take for instance q = 21 . Then (5.32) becomes


√ 2
15 2 x2
f (x) = 1− 2 .
32σ σ +

Note that this distribution vanishes outside the interval [−σ, σ]. Then, the q ∗ 1
case, reproduces the conventional Gauss distribution. For q = 2 one obtains

1 σ
f (x) = . (5.33)
π x + σ2
2

This is known as the Cauchy distribution. The function (5.33) is also called a
Lorentzian. In the range 1 ∇ q < 3 the q-Gaussian is strictly positive on the whole
5 A Riemannian Geometry in the q-Exponential Banach Manifold 107

line. For q < 1 it is zero outside an interval. For q ⊆ 3 the distribution cannot be
normalized because
1
f (x) ∧ 2
as |x| ∗ ≡ .
|x| q−1

5.2.3.2 Kappa-Distributions 1 < q < 5


3

The following distribution is known in plasma physics as the kappa-distribution

1 v2
f (v) =  1+k . (5.34)
A(κ)v03 1 v2
1+ κ−a v 2
0

This distribution is a modification of the Maxwell distribution, exhibiting a power


law decay like f (v) ≥ v −2κ for large v.
Expression (5.34) can be written in the form of q-exponential with q = 1 + 1+κ 1

and 
v2 1 v2
f (v) = expq − . (5.35)
A(κ)v03 2 − q − (q − 1)a v02

However, in order to be of the form (5.31), the pre-factor of (5.35) should not
depend on the parameter v0 . Introduce an arbitrary constant c > 0 with the dimen-
sions of a velocity. The one can write

4πv 2  v 3  v 3 q−1
0 0
f (v) = 3 expq − ln2−q 4π A(κ) − 4π A(κ) h(q, v) .
c c c

v2
where h(q, v) = 1
2−q−(q−1)a v 2 .
0
In the case q ∗ 1, one obtains the Maxwell-distribution.

5.2.3.3 Speed of Harmonic Oscillator (q = 3)

The distribution of velocities v of a classical harmonic oscillator is given by

1 1
f (v) =  .
π v2 − v2
0

It diverges when |v| approaches its maximal value v0 and vanished for |v| > v0 .
This distribution can e written into the form (5.31) of a q-exponential family with
q = 3. To do so, let x = v and
108 H. R. Quiceno et al.

2
c(x) = ,
π|v|
1
β = mv02 ,
2
1 1
H (x) = 1 = ,
2 K
2 mv
3
α(β) = − .
2

5.3 Amari and Ohara’s q-Manifold

On 2011, Amari and Ohara [3], constructed a finite dimensional manifold for the
family of q-exponential distributions with the properties of a dually flat geometry
derived from Legendre transformations and such that the maximizer of the q-scort
distribution is a Bayesian estimator.
Let Sn denote the family of discrete distributions over (n + 1) elements X =
{x0 , x1 , . . . , xn }; put pi = Prob {x = xi } and denote the probability distribution
vector p = ( p0 , p1 , . . . , pn ).
The probability of x is written as


n
p(x) = pi δi (x).
i=0

Proposition 8 The family Sn , has the structure of a q-exponential family for any q.
The proof is based on the fact that the q-exponential family is written as

p(x, θ) = expq {θ · x − ψq (θ)}, (5.36)

where ψq (θ) is called the q-potential given by


n
ψq (θ) = − lnq ( p0 ), and p0 = 1 − pi δi (x).
i=0

It should be noted that the family in (5.36), is not in general of the form given in
[16] since it does not have the pre factor (prior distribution) c(x), as in (5.31).
The geometry is induced by a q-divergence defined by the q-potential. Since ψ is a
convex function, it defines a divergence of the Bregman-type

Dq [ p(x, θ1 ) : p(x, θ2 )] = ψq (θ2 ) − ψq (θ1 ) − ψq (θ1 ) · (θ2 − θ1 ),

which simplifies to
5 A Riemannian Geometry in the q-Exponential Banach Manifold 109

1 
n
q 1−q
Dq [ p : r ] = 1− pi ri ,
(1 − q)h q ( p)
i=0


n
q
where p, r are two discrete distributions and h q ( p) = pi .
i=0
For two points on the manifold, infinitesimally close, the divergence is
 q
Dq [ p(x, θ) : p(x, θ + dθ)] = gi j (θ)dθi dθ j ,

q
where gi j = ∂i ∂ j ψ(θ) is the q-metric tensor.
Note that, when q = 1, this metric reduces to the usual Fisher metric.
Proposition 9 The q-metric is given by a conformal transformation of the Fisher
information metric giFj , as
q q
gi j = gF .
h q (θ) i j

With this Riemannian geometry, the geodesics curves of the manifold, given by

lnq [ p(x, t)] = t lnq [ p1 (x)] + (1 − t) lnq [ p2 (x)] − c(t),

allows to find the maximizer of the q-score function as a Bayesian estimator, see [3].

5.4 q-Exponential Statistical Banach Manifold

Let (Γ, Ω, μ) be a probability space and q a real number such that 0 < q < 1.
Denote by Mμ the set of strictly positive probability densities μ-a.e. For each p ◦ Mμ
consider the probability space (Γ, Ω, p · μ), where p · μ is the probability measure
given by
( p · μ)(A) = pdμ.
A

Thus, the space of essentially bounded functions L ≡ ( p · μ) is a Banach space


respect to the essential supremum norm denoted by ∀ · ∀ p,≡ and is included on the
set of random variables u ◦ L 1 ( p · μ) such that the maps û p : R ∗ [0, +≡] given
by
û p (t) := e(tu) pdμ = E p [e(tu) ],

are finite in a neighborhood of the origin 0.


For each p ◦ Mμ consider

B p := {u ◦ L ≡ ( p · μ) : E p [u] = 0},
110 H. R. Quiceno et al.

which (with the essential supremum norm) is a closed normed subspace of the Banach
space L ≡ ( p · μ), thus B p is a Banach space.
Two probability densities p, z ◦ Mμ are connected by a one-dimensional
q-exponential model if there exist r ◦ Mμ , u ◦ L ≡ (r · μ), a real function of
real variable ψ and δ > 0 such that for all t ◦ (−δ, δ), the function f defined by

tu≤q ψ(t)
f (t) = eq r,

satisfies that there are t0 , t1 ◦ (−δ, δ) for which p = f (t0 ) and z = f (t1 ). The
function f is called one-dimensional q-exponential model, since it is a deformation
of the model f (t) = etu−ψ(t) r , see [18].
Define the mapping M p by
M p (u) = E p [eq(u) ],

denoting its domain by D M p √ L ≡ ( p · μ). Also, define the mapping K p : B p,≡


(0, 1) ∗ [0, ≡], for each u ◦ B p,≡ (0, 1), by

K p (u) = lnq [M p (u)].

Using results on convergence of series in Banach spaces, see [12], it can be proven
that the domain D M p of M p contains the open unit ball B p,≡ (0, 1) √ L ≡ ( p · μ).
Also if restricting M p to B p,≡ (0, 1), this function is analytic and infinitely Fréchet
differentiable.
Let (Γ, Ω, μ) be a probability space and q a real number with 0 < q < 1. Let
be V p := {u ◦ B p : ∀u∀ p,≡ < 1}, for each p ◦ Mμ . Define the maps

eq, p : V p ∗ Mμ

by
(u≤q K p (u))
eq, p (u) := eq p,

which are injective and their ranges are denoted by U p . For each p ◦ Mμ the map

sq, p : U p ∗ V p

given by    
z z
sq, p (z) := lnq ≤q E p lnq ,
p p

is precisely the inverse map of eq, p . Maps sq, p are the coordinate maps for the
manifold and the family of pairs (U p , sq, p ) p◦Mμ define an atlas on Mμ ; and the
transition maps (change of coordinates), for each

u ◦ sq, p1 (U p1 U p2 ),
5 A Riemannian Geometry in the q-Exponential Banach Manifold 111

are given by

u →q lnq ( pp21 ) − E p2 [u →q lnq ( pp21 )]


s p2 (e p1 (u)) = .
1 + (1 − q)E p2 [u →q lnq ( pp21 )]

Given 
u ◦ sq, p1 (Uq, p1 Uq, p2 ),

the derivative of map sq, p2 ∈ sq,−1 evaluated at u in the direction of v ◦ L ≡ ( p · μ)


p1 1
is of form
−1
D(sq, p2 ∈ sq, p1 )(u) · v = A(u) − B(u)E p2 [A(u)],

where A(u), B(u) are functions depending on u. This, allows to establish the main
result of [15] (Theorem 14), that is, the collection of pairs {(U p , sq, p )} p◦Mμ is a
C ≡ -atlas modeled on B p , and the corresponding manifold is called q−exponential
statistical Banach manifold.
Finally, the tangent bundle of the manifold, is characterized, (Proposition 15) [15],
by regular curves on the manifold, as follows.
Let g(t) be a regular curve for Mμ where g(t0 ) = p, and u(t) ◦ Vz be its coordinate
representation over sq,z .
Then
[u(t)≤q K z (u(t))]
g(t) = eq z

and:
 
g(t)
1. d
dt lnq p t=t = T u ⊗ (t) − Q[M p (u(t))]1−q E p [u ⊗ (t)] for some constants T
0
and Q.
2. If z = p, i.e. the charts are centered in the same point;the tangent vectors are
d
identified with the q-score function in t given by dt lnq g(t)
p = T u ⊗ (t0 ).
t=t0
3. Consider a two dimensional q-exponential model

(tu≤q K p (tu))
f (t, q) = eq p (5.37)

where if q ∗ 1 in (5.37), one obtains the one dimensional exponential models


e(tu−K p (tu)) p.
The q-score function  
d f (t)
lnq
dt p t=t0

for a one dimensional q-exponential model (5.37) belongs to span[u] at t = t0 ,


where u ◦ V p .
So the tangent space T p (Mμ ) is identified with the collection of q-score functions
or the one dimensional q−exponential models (5.37), where the charts (trivializing
112 H. R. Quiceno et al.

mappings) are given by

(g, u) ◦ T (U p ) ∗ (sq, p (g), A(u) − B(u)E p [A(u)]),

defined in the collection of open subsets U p × V p of Mμ × L ≡ ( p · μ).

5.5 Induced Geometry

In this section, we will find a metric and then the connections of the manifold,
derived from the q-divergence functional and characterizing the geometry of the
q-exponential manifold. For further details see [22].
The q-divergence functional is given as follows [15].
Let f be a function, defined for all t ←= 0 and 0 < q < 1, by
 
1
f (t) = −t lnq
t

and for p, z ◦ Mμ , the q-divergence of z with respect to p is given by


 
z 1
I (q) (z|| p) := p f dμ = 1− z q p 1−q dμ , (5.38)
Ψ p 1−q Ψ

which is the Tsallis’ divergence functional [9]. Some properties of this functional
are well known, for example that it is equal to the α−divergence functional up to
a constant factor where α = 1 − 2q, satisfying the invariance criterion. Moreover,
when q ∗ 0 then I (q) (z|| p) = 0 and if q ∗ 1 then I (q) (z|| p) = K (z|| p) which is
the Kullback-Leibler divergence functional [13]. As a consequence of Proposition
(17) in [15], the manifold is related with the q-divergence functional as
    
1 z (q)
sq, p (z) = lnq + I ( p||z) .
1 + (q − 1)I (q) ( p||z) p

The following result, is necessary to guarantee the existence of the Riemannian


metric of the manifold.

Proposition 10 Let p, z ◦ Mμ then

(du )z I (q) (z|| p)|z= p = (dv ) p I (q) (z|| p)|z= p = 0,

where the subscript p, z means that the directional derivative is taken with respect
to the first and the second arguments in I (q) (z|| p), respectively, along the direction
u ◦ Tz (Mμ ) or v ◦ T p (Mμ ).
5 A Riemannian Geometry in the q-Exponential Banach Manifold 113

Proof Writing (5.38) as


 
(q) (1 − q) p + (q)z − p (1−q) z (q)
I (z|| p) = dμ,
Γ 1−q

it follows  
1
(du )z I (q) (z|| p) = q − qp (1−q) z (q−1) udμ
1−q Γ

and  
1
(dv ) p I (q) (z|| p) = (1 − q) − (1 − q) p (−q) z (q) vdμ;
1−q Γ

when z = p the desired result holds. ∓



According to Proposition (16) in [15], the functional I (q) (z|| p) is bounded, since:

I (q) (z|| p) ⊆ 0 and equality holds iff p = z

and  
(q) z ⊗
I (z|| p) ∇ (z − p) f dμ.
Γ p

Then, together with previous proposition, the q-divergence functional induces a


Riemannian metric g and a pair of connections, see Eguchi [8], given by:

g(u, v) = −(du )z (dv ) p I (q) (z|| p)|z= p (5.39)


(q)
∨⊂w u, v˘ = −(dw )z (du )z (dv ) p I (z|| p)|z= p , (5.40)

where v ◦ T p (Mμ ), u ◦ T p (Mμ ) and w is a vector field.


Denote Ω(Mμ ) the set of vector fields u : U p ∗ T p (U p ), and F(Mμ ) the set of
C ≡ functions f : U p ∗ R. The following result establish the metric.
Proposition 11 Let p, z ◦ Mμ and v, u vector fields, the metric tensor (field)
g : Ω(Mμ ) × Ω(Mμ ) ∗ F(Mμ ) is given by

uv
g(u, v) = q dμ.
Γ p

Proof By direct calculation over I (q) (z|| p), one gets

1  
(dv ) p I (q) (z|| p) = (1 − q) − (1 − q) p (−q) z (q) vdμ
1−q Γ

and  
(du )z (dv ) p I (q) (z|| p) = −q p (−q) z (q−1) uvdμ,
Γ
114 H. R. Quiceno et al.

so by (5.39), it follows
uv
g(u, v) = q dμ.
Γ p



Note that when q ∗ 1, this metric reduces to the one induced by the (α, β)-
divergence functional which induces the Fisher metric on parametric models.
The connections are characterized as follows.
Proposition 12 The family of covariant derivatives (connections)

⊂(q)
w u : Ω(Mμ ) × Ω(Mμ ) ∗ Ω(Mμ ),

are given as  
1−q
⊂(q)
w u = dw u − uw.
p

Proof Considering (du )z (dv ) p I (q) (z|| p) as in proof of Proposition 5.11, we get
 
−(dw )z (du )z (dv ) p I (q) (z|| p) = q p (−q) (q − 1)z (q−2) uw + z (q−1) dw u vdμ.
Γ

By the previous proposition it follows that

  (q)
⊂w u
g ⊂(q)
w u, v = q vdμ
Γ p

and then  
g ⊂(q) (q)
w u, v = ∨⊂w u, v˘.

Then,
  (q)
⊂w u
q p −1 (q − 1) p −1 uw + dw u = q ,
p
so  
1−q
⊂(q)
w u = dw u − uw.
p



ˇ(q)
It is easy to prove that the associated conjugate connection is given by ⊂w u
=
dw u − qp uw. Notice that taking q = 1−α2 yields to the Amaris’s one-parameter family
of α−connections in the form
 
(α) 1+α
⊂w u = dw u − uw;
2p
5 A Riemannian Geometry in the q-Exponential Banach Manifold 115

and taking q = 21 the Levi-Civita connection results.


Finally, the geometry of the manifold is characterized by calculating the curvature
and torsion tensors, for which it will be proved that equals zero.
Proposition 13 For the q-exponential manifold and the connection given in the
previous proposition, the curvature tensor and the torsion tensor satisfy R(u, v, w) =
0 and T (u, v) = 0.
Proof Remember that

R(u, v, w) = ⊂u ⊂v w − ⊂v ⊂u w − ⊂[u,v] w (5.41)


T (u, v) = ⊂u v − ⊂v u − [u, v]. (5.42)

Using the general form


⊂v w = dv w + Φ(v, w),

where
Φ : Ω(Mμ ) × Ω(Mμ ) ∗ Ω(Mμ )

is a bilinear form (the counterpart of the Christoffel symbol); and since

du (⊂v w) = du (dv w) + Φ (du v, w) + Φ (v, du w) + du Φ,

where du Φ = Φ ⊗ u is the derivative of the bilinear form Φ, it follows that

⊂u ⊂v w = du (dv w) + Φ (du v, w) + Φ (v, du w) + du Φ


+ Φ (u, dv w) + Φ (u, Φ(v, w)) ,

and

⊂v ⊂u w = dv (du w) + Φ (dv u, w) + Φ (u, dv w) + dv Φ


+ Φ (v, du w) + Φ (v, Φ(u, w)) ;

moreover,

⊂[u,v] w = d[u,v] w + Φ (du v − dv u, w) = d[u,v] w + Φ (du v, w) − Φ (dv u, w) ,

where [u, v] = du v − dv u; and then

⊂[u,v] w = du (dv w) − dv (du w) + Φ (du v, w) − Φ (dv u, w) .

Substituting in (5.41) and (5.42) it follows that

R(u, v, w) = Φ (u, Φ(v, w)) − Φ (v, Φ(u, w)) + du Φ(v, w) − dv Φ(u, w),
116 H. R. Quiceno et al.

and
T (u, v) = Φ(u, v) − Φ(v, u).

Since Φ(u, v) = − 1−q


p uv, and du Φ(v, w) =
1−q
p2
uvw, one gets:
   
1−q 1−q
R(u, v, w) = Φ u, − vw − Φ v, − uw
p p
1−q 1−q
+ uvw − vuw
p2 p2
1−q 1−q 1−q 1−q
=− u − vw + v − uw
p p p p
=0

and
1−q 1−q
T (u, v) = − uv + vu = 0.
p p


Since the mapping ⊂(q) ↔ ⊂(α) is smooth, it is expected that the geodesic
curves and parallel transports obtained from the q-connections preserves a smooth
isomorphism with the curves given by α-connections. Also, it must be investigated
if the metric tensor field in Proposition 5.11 is given by a conformal transformation
of the Fisher information metric.

References

1. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, New York (1985)


2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society,
RI: Providence (2000) (Translated from the 1993 Japanese original by Daishi Harada)
3. Amari, S., Ohara, A.: Geometry of q-exponential family of probability distributions. Entropy
13, 1170–85 (2011)
4. Borges, E.P.: Manifestaões dinâmicas e termodinâmicas de sistemas não-extensivos. Tese de
Dutorado, Centro Brasileiro de Pesquisas Fisicas, Rio de Janeiro (2004)
5. Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56 (2006)
6. Dawid, A.P.: On the conceptsof sufficiency and ancillarity in the presence of nuisance parame-
ters. J. Roy. Stat. Soc. B 37, 248–258 (1975)
7. Efron, B.: Defining the curvature of a statistical problem (with applications to second order
efficiency). Ann. Stat. 3, 1189–242 (1975)
8. Eguchi, S.: Second order efficiency of minimum coontrast estimator in a curved exponential
family. Ann. Stat. 11, 793–03 (1983)
9. Furuichi, S.: Fundamental properties of Tsallis relative entropy. J. Math. Phys. 45, 4868–77
(2004)
10. Gibilisco, P., Pistone, G.: Connections on non-parametric statistical manifolds by Orlicz space
geometry. Inf. Dim. Anal. Quantum Probab. Relat. Top. 1, 325–47 (1998)
5 A Riemannian Geometry in the q-Exponential Banach Manifold 117

11. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. Roy. Soc.
A 186, 453–61 (1946)
12. Kadets, M.I., Kadets, V.M.: Series in Banach spaces. In: Conditional and Undconditional
Convergence. Birkaaauser Verlang, Basel (1997) (Traslated for the Russian by Andrei Iacob)
13. Kulback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
14. Loaiza, G., Quiceno, H.: A Riemannian geometry in the q-exponential Banach manifold
induced by q-divergences. Geometric science of information, In: Proceedings of First Interna-
tional Conference on GSI 2013, pp. 737–742. Springer, Paris (2013)
15. Loaiza, G., Quiceno, H.R.: A q-exponential statistical Banach manifold. J. Math. Anal. Appl.
398, 446–6 (2013)
16. Naudts, J.: The q-exponential family in statistical physics. J. Phys. Conf. Ser. 201, 012003
(2010)
17. Pistone, G.: k-exponential models from the geometrical viewpoint. Eur. Phys. J. B 70, 29–37
(2009)
18. Pistone, G., Rogantin, M.-P.: The exponential statistical manifold: Parameters, orthogonality
and space transformations. Bernoulli 4, 721–760 (1999)
19. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the
probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
20. Rao, C.R: Information and accuracy attainable in estimation of statistical parameters. Bull.
Calcutta Math. Soc. 37, 81–91 (1945)
21. Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52, 479–487
(1988)
22. Zhang, J.: Referential duality and representational duality on statistical manifolds. In: Proceed-
ings of the 2nd International Symposium on Information Geometry and its Applications, pp.
58–67, Tokyo, (2005)
Chapter 6
Computational Algebraic Methods
in Efficient Estimation

Kei Kobayashi and Henry P. Wynn

Abstract A strong link between information geometry and algebraic statistics is


made by investigating statistical manifolds which are algebraic varieties. In particular
it is shown how first and second order efficient estimators can be constructed, such as
bias corrected Maximum Likelihood and more general estimators, and for which the
estimating equations are purely algebraic. In addition it is shown how Gröbner basis
technology, which is at the heart of algebraic statistics, can be used to reduce the
degrees of the terms in the estimating equations. This points the way to the feasible
use, to find the estimators, of special methods for solving polynomial equations,
such as homotopy continuation methods. Simple examples are given showing both
equations and computations.

6.1 Introduction

Information geometry gives geometric insights and methods for studying the statis-
tical efficiency of estimators, testing, prediction and model selection. The field of
algebraic statistics has proceeded somewhat separately but recently a positive effort
is being made to bring the two subjects together, notably [15]. This paper should be
seen as part of this effort.
A straightforward way of linking the two areas is to ask how far algebraic methods
can be used when the statistical manifolds of information geometry are algebraic,
that is algebraic varieties or derived forms, such as rational quotients. We call such
models “algebraic statistical models” and will give formal definitions.

K. Kobayashi (B)
The Institute of Statistical Mathematics, 10-3, Midori-cho,
Tachikawa, Tokyo, Japan
e-mail: [email protected]
H. P. Wynn
London School of Economics, London, UK
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 119


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_6,
© Springer International Publishing Switzerland 2014
120 K. Kobayashi and H. P. Wynn

In the standard theory for non-singular statistical models, maximum likeli-


hood estimators (MLEs) have first-order asymptotic efficiency and bias-corrected
MLEs have second-order asymptotic efficiency. A short section covers briefly the
basic theory of asymptotic efficiency using differential geometry, necessary for our
development.
We shall show that for some important algebraic models, the estimating equations
of MLE type become polynomial and the degrees usually become very high if the
model has a high-dimensional parameter space. In this paper, asymptotically efficient
algebraic estimators, a generalization of bias corrected MLE, are studied. By alge-
braic estimators we mean estimators which are the solution of algebraic equations.
A main result is that for (algebraic) curved exponential family, there are second-order
efficient estimators whose polynomial degree is at most two. These are computed
by decreasing the degree of the estimating equations using Gröbner basis methods,
the main tool of algebraic statistics. We supply some the basic Gröbner theory in
Appendix A. See [22].
The reduction of the degree saves computational costs dramatically when we use
computational methods for solving the algebraic estimating equations. Here we use
homotopy continuation methods of [19, 24] to demonstrate this effect for a few
simple examples, for which we are able to carry out the Gröbner basis reduction.
Appendix B discusses homotopy continuation methods.
Although, as mentioned, the links between computational algebraic methods and
the theory of efficient estimators based on differential geometry are recent, two
other areas of statistics, not covered here, exploit differential geometry methods.
The first is tube theory. The seminal paper by Weyl [26] has been used to give exact
confidence level values (size of tests), and bounds, for certain Gaussian simultaneous
inference problems: [17, 20]. This is very much related to the theory of up-crossings
of Gaussian processes using expected Euler characteristic methods, see [1] and earlier
papers. The second area is the use of the resolution of singularities (incidentally
related to the tube theory) in which confidence levels are related to the dimension
and the solid angle tangent of cones with apex at a singularity in parameters space
[12, 25]. Moreover, the degree of estimating equations for MLE has been studied
for some specific algebraic models, which are not necessarily singular [11]. In this
paper we cover the non-singular case, for rather more general estimators than MLE,
and show that algebraic methods have a part to play.
Most of the theories in the paper can be applied to a wider class of Multivariate
Gaussian models with some restrictions on their covariance matrices, for example
models studied in [6, 14]. Though the second-order efficient estimators proposed in
the paper can be applied to them potentially, the cost for computing Gröbner basis
prevents their direct application. Further innovation in the algebraic computation is
required for real applications, which is a feature of several other areas of algebraic
statistics.
Section 6.2 gives some basic background in estimation and differential geome-
try for it. Sections 6.3 and 6.4, which are the heart of the paper, give the algebraic
developments and Sect. 6.5 gives some examples. Section 6.6 carries out some com-
putation using homotopy continuation.
6 Computational Algebraic Methods in Efficient Estimation 121

6.2 Statistical Manifolds and Efficiency of Estimators

In this section, we introduce the standard setting of statistical estimation theory, via
information geometry. See [2, 4] for details. It is recognized that the ideas go back to
at least the work of Rao [23], Efron [13] and Dawid [10]. The subject of information
geometry was initiated by Amari and his collaborators [3, 5].
Central to this family of ideas is that the rates of convergence of statistical esti-
mators and other test statistics depend on the metric and curvature of the parametric
manifolds in a neighborhood of the MLE or the null hypothesis. In addition Amari
realized the importance of two special models, the affine exponential model and the
affine mixture model, e and m frame respectively. In this paper we concentrate on
the exponential family model but also look at curved subfamilies. By extending the
dimension of the parameter space of the exponential family, we are able to cover
some classes of mixture models. The extension of the exponential model to infinite
dimensions is covered by [21].

6.2.1 Exponential Family and Estimators

A full exponential family is a set of probability distributions {d P(x|α) | α ⊂ Θ} with


a parameter space Θ ∗ Rd such that

d P(x|α) = exp(xi αi − ∂(α))dτ,

where x ⊂ Rd is a variable representing a sufficient statistic and τ is a carrier measure


on Rd . Here xi αi means i xi αi (Einstein summation notation).
We call α a natural parameter and δ = δ(α) := E[x|α] an expectation parameter.
Denote E = E(Θ) := {δ(α) | α ⊂ Θ} ∗ Rd as the corresponding expectation
parameter space. Note that the relation δ(α) = ∈α ∂(α) holds. If the parameter space
is restricted to a subset VΘ ∗ Θ, we obtain a curved exponential family

{d P(x|α) | α ⊂ VΘ }.

The corresponding space of the expectation parameter is denoted by V E := {δ(α) |


α ⊂ VΘ } ∗ E.
Figure 6.1 explains how to define an estimator by a local coordinate. Let (u, v) ⊂
R p × Rd− p with a dimension p of VΘ be a local coordinate system around the true
parameter α∗ and define U ∗ R p such that {α(u, 0)|u ⊂ U} = VΘ . For a full expo-
nential model with N samples obtained by composing a map (X (1) , . . . , X (N ) ) ≡∇
α(δ)|δ= X̄ and a coordinate projection map α(u, v) ≡∇ u, we can define a (local)
estimator (X (1) , . . . , X (N ) ) ≡∇ u. We define an estimator by δ(u, v) similarly. Since
X̄ is a sufficient statistic of α (and δ) in the full exponential family, every estimator
can be computed by X̄ rather than the original data {X i }. Therefore in the rest of the
paper, we write X as shorthand for X̄ .
122 K. Kobayashi and H. P. Wynn

Fig. 6.1 A projection to the


model manifold according to
a local coordinate defines an
estimator

6.2.2 Differential Geometrical Entities

Let w := (u, v) and use indexes {i, j, . . .} for α and δ, {a, b, . . .} for u, {ψ, φ, . . .}
for v and {β, γ, . . .} for w. The following are used for expressing conditions for
asymptotic efficiency of estimators, where Einstein notation is used.

Differential geometrical entities


ω
– δi (α) = ∂(α),
ωαi
ω 2 ∂(α)
– Fisher metric G = (gi j ) w.r.t. α: gi j (α) = ,
ωαi ωα j
– Fisher metric Ḡ = (g i j ) w.r.t. δ: Ḡ = G −1 ,
ωδi (w)
– Jacobian: Biβ (α) := ,
ωw β
  
(e) ω2 ω
– e-connection: Γβγ,ξ = α i
(w) δi (w) ,
ωw β ωw γ ωw ξ
  
(m) ω2 ω i
– m-connection: Γβγ,ξ = δ i (w) α (w) ,
ωw β ωw γ ωw ξ

6.2.3 Asymptotic Statistical Inference Theory

Under some regularity conditions on the carrier measure τ, potential function ∂ and
the manifolds VΘ or V E , the asymptotic theory below is available. These condi-
tions are for guaranteeing the finiteness of the moments and the commuting of the
6 Computational Algebraic Methods in Efficient Estimation 123

ω
expectation and the partial derivative ωα E α [ f ] = E α [ ωωαf ]. For more details of the
required regularity conditions, see Sect. 2.1 of [4].
1. If û is a consistent estimator (i.e. P(◦û − u◦ > η) ∇ 0 as N ∇ → for any
η > 0), the squared error matrix of û is

E u [(û − u)(û − u)∅ ] = E u [(û a − u a )(û b − u b )]


= N −1 [gab − gaψ g ψφ gbφ ]−1 + O(N −2 ).

Here [·]−1 means the matrix inverse. Thus, if gaψ = 0 for all a and ψ, the main
term in the r.h.s. becomes minimum. We call such an estimator as a first-order
efficient estimator.
2. The bias term becomes

E u [û a − u a ] = (2N )−1 ba (u) + O(N −2 )

for each a where ba (u) := Γ (m)acd (u)g cd (u). Then, the bias corrected estimator
ǔ a := û a − ba (û) satisfies E u [ǔ a − u a ] = O(N −2 ).
3. Assume gaψ = 0 for all a and ψ, then the square error matrix is represented by
⎛ ⎝
(m) (e) (m)
1 1
E u [(ǔ a − u a )(ǔ b − u b )] = g ab + Γ M2ab +2 HM2ab
+ H A2ab + o(N −2 ).
N 2N 2

See Theorem 5.3 of [4] and Theorem 4.4 of [2] for the definition of the terms in
the r.h.s. Of the four dominating terms in the r.h.s., only

(m)
H A2ab := g ψμ g φτ H (m)aψφ H (m)bμτ

depends on the selection of the estimator.

Here H (m)aψφ is an embedding curvature and equal to Γ (m)aψφ when gaψ = 0 for
(m)
every a and ψ. Since H A2ab is the square of Γ (m)aψφ , the square error matrix
attains the minimum in the sense of positive definiteness if and only if
⎞   ⎞
⎞ ω2 ω i ⎞
Γ (m) ψφ,a (w)⎞ = δ (w) α (w) ⎞ = 0. (6.1)
v=0 ωv ψ ωv φ
i
ωu a ⎞
v=0

Therefore the elimination of the m-connection (6.1) implies second-order


efficiency of the estimator after a bias correction, i.e. it becomes optimal among
the bias-corrected first-order efficient estimators up to O(N −2 ).
124 K. Kobayashi and H. P. Wynn

6.3 Algebraic Models and Efficiency of Algebraic Estimators

This section studies asymptotic efficiency for statistical models and estimators which
are defined algebraically. Many models in statistics are defined algebraically. Perhaps
most well known are polynomial regression models and algebraic conditions on
probability models such as independence and conditional independence. Recently
there has been considerable interest in marginal models [7] which are typically linear
restrictions on raw probabilities. In time series autoregressive models expressed by
linear transfer functions induce algebraic restrictions on covariance matrices. Our
desire is to have a definition of algebraic statistical model which can be expressed
from within the curved exponential family framework but is sufficiently broad to
cover cases such as those just mentioned. Our solution is to allow algebraic conditions
in the natural parameter α, mean parameter δ or both. The second way in which
algebra enters is in the form of the estimator.

6.3.1 Algebraic Curved Exponential Family

We say a curved exponential family is algebraic if the following two conditions are
satisfied.
(C1) VΘ or V E is represented by a real algebraic variety, i.e. VΘ := V( f 1 , . . . , f k )
= {α ⊂ Rd | f 1 (α) = · · · = f k (α) = 0} or similarly V E := V(g1 , . . . , gk ) for
f i ⊂ R[α1 , . . . , αd ] and gi ⊂ R[δ1 , . . . , δd ].
(C2) α ≡∇ δ(α) or δ ≡∇ α(δ) is represented by some algebraic equations, i.e. there are
h 1 , . . . , h k ⊂ R[α, δ] such that locally in VΘ × V E , h i (α, δ) = 0 iff δ(α) = δ
or α(δ) = α.
Here R[α1 , . . . , αd ] means a polynomial of α1 , . . . , αd over the real number field R
and R[α, δ] means R[α1 , . . . , αd , δ1 , . . . , δd ]. The integer k, the size of the genera-
tors, is not necessarily equal to d − p but we assume VΘ (or V E ) has dimension p
around the true parameter. Note that if ∂(α) is a rational form or the logarithm of a
rational form, (C2) is satisfied.

6.3.2 Algebraic Estimators

The parameter set VΘ (or V E ) is sometimes singular for algebraic models. But
throughout the following analysis, we assume non-singularity around the true para-
meter α∗ ⊂ VΘ (or δ ∗ ⊂ V E respectively).
Following the discussion at the end of Sect. 6.2.1. We call α(u, v) or δ(u, v) an
algebraic estimator if
(C3) w ≡∇ δ(w) or w ≡∇ α(w) is represented algebraically.
6 Computational Algebraic Methods in Efficient Estimation 125

We remark that the MLE for an algebraic curved exponential family is an algebraic
estimator.
If conditions (C1), (C2) and (C3) hold, then all of the geometrical entities in
Sect. 6.2.2 are characterized by special polynomial equations. Furthermore, if ∂(α) ⊂
R(α) ← log R(α) and α(w) ⊂ R(w) ← log R(w), then the geometrical objects have
the additional property of being rational.

6.3.3 Second-Order Efficient Algebraic Estimators,


Vector Version

Consider an algebraic estimator δ(u, v) ⊂ R[u, v]d satisfying the following vector
equation:


d ⎠
p
X = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u) (6.2)
i= p+1 j=1

where, for each u, {e j (u); j = 1, . . . , p} ← {ei (u); i = p + 1, . . . , d} is a complete


basis of Rd such that ≤e j (u), ( u δ)g = 0 and f j (u, v) ⊂ R[u][v]⊆3 , namely a
polynomial whose degree in v is at least 3 with coefficients  polynomial in u, for
j = 1, . . . , p. Remember we use a notation X = X̄ = N1 i X i . The constant c is
to control the perturbation (see below).
A straightforward computation of the m-connection in (6.1) at v = 0 for


d ⎠
p
δ(w) = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u)
i= p+1 j=1

shows it to be zero. This gives


Theorem 1 Vector equation (6.2) satisfies the second-order efficiency (6.1).
We call (6.2) a vector version of a second-order efficient estimator. Note that if
c = 0, (6.2) gives an estimating equation for the MLE. Thus the last term in (6.2)
can be recognized as a perturbation from the MLE.
Figure 6.2 is a rough sketch of the second-order efficient estimators. Here the
model is embedded in an m-affine space. Given a sample (red point), the MLE
is an orthogonal projection (yellow point) to the model with respect to the Fisher
metric. But a second-order efficient estimator maps the sample to the model along a
“cubically” curved manifold (red curve).
126 K. Kobayashi and H. P. Wynn

Fig. 6.2 Image of the vector


version of the second-order
efficient estimators

6.3.4 Second-Order Efficient Algebraic Estimators,


Algebraic Version

Another class of second-order efficient algebraic estimators we call the algebraic


version, which is defined by the following simultaneous polynomial equations with
δu = δ(u, 0).

(X − δu )∅ ẽ j (u, δu ) + c · h j (X, u, δu , X − δu ) = 0 for j = 1, . . . , p (6.3)

where {ẽ j (u, δu ) ⊂ R[u, δu ]d ; j = 1, . . . , p} span ((∈u δ(u, 0))∧Ḡ )∧ E for every
u and h j (X, u, δu , t) ⊂ R[X, u, δu ][t]3 (degree = 3 in t) for j = 1, . . . , p. The
constant c is to control the perturbation. The notation Ḡ represents the Fisher metric
on the full-exponential family with respect to δ. The notation (∈u δ(u, 0))∧Ḡ means
the subspace orthogonal to span(ωa δ(u, 0))a=1 with respect to Ḡ and (·)∧ E means
p

the orthogonal complement in the sense of Euclidean vector space. Here, the term
“degree” of a polynomial means the maximum degree of its terms. Note that the
case (X − δu )∅ ẽ j (u, δu ) = 0 for j = 1, . . . , p gives a special set of the estimating
equations of the MLE.

Theorem 2 An estimator defined by a vector version (6.2) of the second-order effi-


cient estimators is also represented by an algebraic version (6.3) where h j (X, u,
δu , t) = f˜j (u, (ẽi∅ t)i=1 , (ẽi∅ (X −δu ))i=1 ) with a function f˜j (u, v, ṽ) ⊂ R[u, ṽ][v]3
p p

such that f˜(u, v, v) = f (u, v).

Proof Take the Euclidean inner product of both sides of (6.2) with each ẽ j which is
a vector Euclidean orthogonal to the subspace span({ei |i ≥= j}) and obtain a system
of polynomial equations. By eliminating variables v from the polynomial equations,
an algebraic version is obtained. 
6 Computational Algebraic Methods in Efficient Estimation 127

Theorem 3 Every algebraic equation (6.3) gives a second-order efficient estimator


(6.1).

Proof Writing X = δ(u, v) in (6.3), we obtain

(δ(u, v) − δ(u, 0))∅ ẽ j (u) + c · h j (δ(u, v), u, δ(u, 0), δ(u, v) − δ(u, 0)) = 0.

Partially differentiate this by v twice, we obtain


 ∅ ⎞
ω 2 δ(u, v) ⎞

ẽ j (u)⎞ = 0,
ωv φ ωv ψ ⎞
v=0

since each term of h j (δ(u, v), u, δ(u, 0), δ(u, v) − δ(u, 0)) has degree more than 3
in its third component (δi (u, v) − δi (u, 0))i=1
d and δ(u, v) − δ(u, 0)|
v=0 = 0. Since

span{ẽ j (u); j = 1, . . . , p} = ((∈u δ(u, 0)) ) Ḡ ∧ E = span{Ḡωu a δ; a = 1, . . . , p},
we obtain
⎞ ⎞
(m) ⎞ ω 2 δi i j ωδ j ⎞⎞
Γψφa ⎞ = g = 0.
v=0 ωv φ ωv ψ ωu a ⎞v=0

This implies the estimator is second-order efficient. 

By Theorems 1, 2 and 3, the relationship between the three forms of the second-
order efficient algebraic estimators is summarized as

(1) ∀ (2) √ (3) √ (1).

Furthermore, if we assume the estimator has a form δ ⊂ R(u)[v], that is a polynomial


in v with coefficients rational in u, every first-order efficient estimator satisfying (6.1)
can be written in a form (6.2) after resetting coordinates v for the estimating manifold.
In this sense, we can say (1) √ (2) and the following corollary holds.
Corollary 1 If δ ⊂ R(u)[v], the forms (1), (2) and (3) are equivalent.

6.3.5 Properties of the Estimators

The following theorem is a straightforward extension of the local existence of MLE.


That is to say, the existence for sufficiently large sample size. The regularity con-
ditions are essentially the same as for the MLE but with an additional condition
referring to the control constant c.
Proposition 1 (Existence and uniqueness of the estimate) Assume that the Fisher
matrix is non-degenerate around δ(u ∗ ) ⊂ V E . Then the estimate given by (6.2) or
128 K. Kobayashi and H. P. Wynn

(6.3) locally uniquely exists for small c, i.e. there is a neighborhood G(u ∗ ) ∗ Rd of
δ(u ∗ ) and κ > 0 such that for every fixed X ⊂ G(u ∗ ) and −κ < c < κ, a unique
estimate exists.
Proof Under the condition of the theorem, the MLE always exists locally. Further-
more, because of the nonsingular Fisher matrix, the MLE is locally bijective (by the
implicit representation theorem). Thus (u 1 , . . . , u p ) ≡∇ (g1 (x −δu ), . . . , g p (x −δu ))
for g j (x − δu ) := (X − δu )∅ ẽ j (u, δu ) in (6.3) is locally bijective. Since {gi } and
{h i } are continuous, we can select κ > 0 for (6.3) to be locally bijective for every
−κ < c < κ. 

6.3.6 Summary of Estimator Construction

We summarize how to define a second-order efficient algebraic estimator (vector


version) and how to compute an algebraic version from it.

Input:
• a potential function ∂ satisfying (C2),
• polynomial equations of δ, u and v satisfying (C3),
• m 1 , . . . , m d− p ⊂ R[δ] such that V E = V (m 1 , . . . , m d− p ) gives the model,
• f j ⊂ R[u][v]⊆3 and c ⊂ R for a vector version

Step 1 Compute ∂ and α(δ), G(δ), (Γ (m) (δ) for bias correction)
Step 2 Compute f ai ⊂ R[δ][ν11 , . . . , ν pd ]1 s.t. f a j (ν11 , . . . , ν pd ) :=
ωu a m j for νbi := ωu b δi .
Step 3 Find e p+1 , . . . , ed ⊂ (∈u δ)∧Ḡ by eliminating {νa j } from
≤ei , ωu a δḠ = eik (δ)g k j (δ)νa j = 0 and f a j (ν11 , . . . , ν pd ) = 0.
Step 4 Select e1 , . . . , e p ⊂ R[δ] s.t. e1 (δ), . . . , ed (δ) are linearly
independent.
Step 5 Eliminate v from
⎠d ⎠p
X = δ(u, 0) + vi− p ei + c · f j (u, v)e j
i= p+1 j=1

and compute (X − δ)∅ ẽ j and h ⊂ (R[δ][X − δ]3 ) p , given by


Theorem 2.

Output(Vector version):
⎠d ⎠p
X = δ(u, 0) + vi− p ei (δ) + c · f j (u, v)e j (δ).
i= p+1 j=1
6 Computational Algebraic Methods in Efficient Estimation 129

Output(Algebraic version):

(X − δ)∅ ẽ + c · h(X − δ) = 0.

6.3.7 Reduction of the Degree of the Estimating Equations

As we noted in Sect. 6.3.4, if we set h j = 0 for all j, the estimator becomes the MLE.
In this sense, ch j can be recognized as a perturbation from the likelihood equations.
If we select each h j (X, u, δu , t) ⊂ R[X, u, δu ][t]3 tactically, we can reduce the
degree of the polynomial estimating equation. For algebraic background, the reader
refers to Appendix A.
Here, we assume u ⊂ R[δu ]. For example, we can set u i = δi . Then ẽ j (u, δu ) is
a function of δu , so we write it as ẽ j (δ). Define an ideal I3 of R[X, δ] as

I3 := ≤{(X i − δi )(X j − δ j )(X k − δk ) | 1 ⊗ i, j, k ⊗ d}.

Select a monomial order ↔ and set δ1 ∓ · · · ∓ δd ∓ X 1 ∓ · · · ∓ X d . Let


G ↔ = {g1 , . . . , gm } be a Gröbner basis of I3 with respect to ↔. Then the remainder
(normal form) r j of (X − δ)∅ ẽ j (δ), the first term of the l.h.s. of (6.3), with respect
to G ↔ , is uniquely determined for each j.

Theorem 4 If the monomial order ↔ is the pure lexicographic,


1. r j for j = 1, . . . , p has degree at most 2 with respect to δ, and
2. r j = 0 for j = 1, . . . , p are the estimating equations for a second-order efficient
estimator.

Proof Assume r j has a monomial term whose degree is more than 2 with respect
to δ and represent the term as δa δb δc q(δ, X ) with a polynomial q ⊂ R(δ, X ) and a
combination of indices a, b, c. Then {δa δb δc +(X a −δa )(X a −δa )(X a −δa )}q(δ, X )
has a smaller polynomial order than δa δb δc q(δ, X ) since ↔ is pure lexicographic
satisfying δ1 ∓ · · · ∓ δd ∓ X 1 ∓ · · · ∓ X d . Therefore by subtracting
(X a −δa )(X a −δa )(X a −δa )}q(δ, X ) ⊂ I3 from r j , the polynomial degree decreases.
This contradicts the fact r j is the normal form so each r j has degree at most 2.
Furthermore each polynomial in I3 is in R[X, u, δu ][X − δ]3 and therefore by
taking the normal form, the condition for the algebraic version (6.3) of second-order
efficiency still holds. 

The reduction of the degree is important when we use algebraic algorithms such
as homotopy continuation methods [18] to solve simultaneous polynomial equations
since computational cost depends highly on the degree of the polynomials.
130 K. Kobayashi and H. P. Wynn

6.4 First-Order Efficiency

It is not surprising that, for first-order efficiency, almost the same arguments hold as
for second-order efficiency.
By Theorem 5.2 of [4], a consistent estimator is first-order efficient if and only if

gψa = 0. (6.4)

Consider an algebraic estimator δ(u, v) ⊂ R[u, v]d satisfying the following vector
equation:


d ⎠
p
X = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u) (6.5)
i= p+1 j=1

where, for each u, {e j (u); j = 1, . . . , p} ← {ei (u); i = p + 1, . . . , d} is a complete


basis of Rd s.t. ≤e j (u), ( u δ)g = 0 and f j (u, v) ⊂ R[u][v]⊆2 , a polynomial whose
degree of v is at least 2, for j = 1, . . . , p. Similarly, c ⊂ R is a constant for
perturbation. Here, the only difference between (6.2) for the second-order efficiency
and (6.5) for the first-order efficiency is the degree of the f j (u, v) with respect to v.
The algebraic version of the first-order efficient algebraic estimator is defined by
the following simultaneous polynomial equalities with δu = δ(u, 0).

(X − δu )∅ ẽ j (u, δu ) + c · h j (X, u, δu , X − δu ) = 0 for j = 1, . . . , p (6.6)

where {ẽ j (u, δu ) ⊂ R[u, δu ]d ; j = 1, . . . , p} span ((∈u δ(u, 0))∧Ḡ )∧ E for every u
and h j (X, u, δu , t) ⊂ R[X, u, δu ][t]2 (degree = 2 w.r.t. t) for j = 1, . . . , p. Here,
the only difference between (6.3) for the second-order efficiency and (6.6) for the
first-order efficiency is the degree of the h j (X, u, δu , t) with respect to t.
Then the relation between the three different forms of first-order efficiency can
be proved in the same way manner as for Theorem 1, 2 and 3.
Theorem 5 (i) Vector version (6.5) satisfies the first-order efficiency.
(ii) An estimator defined by a vector version (6.5) of the first-order efficient estimators
is also represented by an algebraic version (6.6).
(iii) Every algebraic version (6.6) gives a first-order efficient estimator.
The relationship between the three forms of the first-order efficient algebraic esti-
mators is summarized as (4) ∀ (5) √ (6) √ (4). Furthermore, if we assume the
estimator has a form δ ⊂ R(u)[v], the forms (6.4), (6.5) and (6.6) are equivalent.
Let R := Z[X, δ] and define

I2 := ≤{(X i − δi )(X j − δ j ) | 1 ⊗ i, j ⊗ d}


6 Computational Algebraic Methods in Efficient Estimation 131

as an ideal of R. In a similar manner, let ↔ be a monomial order such that δ1 ∓ · · · ∓


δd ∓ X 1 ∓ · · · ∓ X d . Let G ↔ = {g1 , . . . , gm } be a Gröbner basis of I2 with respect
to ↔. The properties of the normal form ri of (X − δ(u, 0))∅ ẽi (u) with respect to
G ↔ are then covered by the following:
Theorem 6 If the monomial order ↔ is the pure lexicographic,
(i) ri for i = 1, . . . , d has degree at most 1 with respect to δ, and
(ii) ri = 0 for i = 1, . . . , d are the estimating equations for a first-order efficient
estimator.

6.5 Examples

In this section, we show how to use the algebraic computation to design asymptoti-
cally efficient estimators for two simple examples. The examples satisfy the algebraic
conditions (C1), (C2) and (C3) so it is verified that necessary geometric entities have
an algebraic form as mentioned in Sect. 6.3.2.

6.5.1 Example: Periodic Gaussian Model

The following periodic Gaussian model shows how to compute second-order effi-
cients estimators and their biases.
• Statistical Model:
⎧ ⎤ ⎧ ⎤
0 1 a a2 a
⎨0⎥ ⎨a 1 a a2⎥
X ∨ N (μ, Σ(a)) with μ = ⎨ ⎥ ⎨
⎩0⎫ and Σ(a) = ⎩a 2
⎥ for 0 ⊗ a < 1.
a 1 a⎫
0 a a2 a 1

Here, the dimension of the full exponential family and the curved exponential
family are d = 3 and p = 1, respectively.
• Curved exponential family:

log f (x|α) = 2 (x1 x2 + x2 x3 + x3 x4 + x4 x1 ) α2 + 2 (x3 x1 + x4 x2 ) α3 − ∂(α),

• Potential function:

∂(α) = − 1/2 log(α1 4 − 4 α1 2 α2 2 + 8 α1 α2 2 α3 − 2 α1 2 α3 2 − 4 α2 2 α3 2 + α3 4 ) + 2 log(2 π),

• Natural parameter:
⎬ ⎭∅
1 a a2
α(a) = ,− , ,
1 − 2a + 4a
2 4 1 − 2a + 4a 1 − 2a 2 + 4a 4
2 4
132 K. Kobayashi and H. P. Wynn

• Expectation parameter: δ(a) = [−2, −4a, −2a 2 ]∅ ,


• Fisher metric with respect to δ:
  ⎜ 
2 a4 +
 4 a 2+⎜ 2 8 a 1 +
2 a2 8 a2 ⎜
(g ) =
ij
8a 1 + a 4 + 24 a 2 + 4⎜a 4 8 a 1 + a 2 ,
8 a2 8 a 1 + a2 2 a4 + 4 a2 + 2

• A set of vectors ei ⊂ R3 :

e0 (a) := [0, −1, a]∅ ⊂ ωa δ(a),

e1 (a) := [3a 2 + 1, 4a, 0]∅ , e2 (a) := [−a 2 − 1, 0, 2]∅ ⊂ (ωa δ(a))∧Ḡ .

• A vector version of the second-order efficient estimator is, for example,

x − δ + v1 · e1 + v2 · e2 + c · v13 · e0 = 0.

• A corresponding algebraic version of the second-order efficient estimator: by elim-


inating v1 and v2 , we get g(a) + c · h(a) = 0 where

g(a) := 8(a − 1)2 (a + 1)2 (1 + 2a 2 )2 (4a 5 −8a 3 + 2a 3 x3 − 3x2 a 2 + 4a + 4ax1 + 2ax3 −x2 )

and
h(a) := (2a 4 + a 3 x2 − a 2 x3 + 2a 2 + ax2 − 2x1 − x3 − 4)3 .

• An estimating equation for MLE:

4a 5 − 8a 3 + 2a 3 x3 − 3x2 a 2 + 4a + 4ax1 + 2ax3 − x2 = 0.

• Bias correction term for an estimator â: â(â 8 − 4â 6 + 6â 4 − 4â 2 + 1)/(1 + 2â 2 )2 .

6.5.2 Example: Log Marginal Model

Here, we consider a log marginal model. See [7] for more on marginal models.
• Statistical model (Poisson regression):
i.i.d
X i j ∨ Po(N pi j ) s.t. pi j ⊂ (0, 1) for i = 1, 2 and j = 1, 2, 3 with model con-
straints:

p11 + p12 + p13 + p21 + p22 + p23 = 1,


p11 + p12 + p13 = p21 + p22 + p23 ,
p11 / p21 p12 / p22
= . (6.7)
p12 / p22 p13 / p23
6 Computational Algebraic Methods in Efficient Estimation 133

Condition (6.7) can appear in a statistical test of whether acceleration of the ratio
p1 j / p2 j is constant.

In this case, d = 6 and p = 3.


• Log density w.r.t. the point mass measure on Z6⊆0 :
 ⎪
⎟  ⎠
log f (x| p) = log e−N pi j (N pi j ) X i j = −N + X i j log(N pi j ).
 
ij ij

• The full expectation family is given by


⎬ ⎭ ⎬ ⎭
X1 X2 X3 X 11 X 12 X 13
:= ,
X4 X5 X6 X 21 X 22 X 23
⎬ ⎭ ⎬ ⎭
δ1 δ2 δ3 p11 p12 p13
=N ,
δ4 δ5 δ6 p21 p22 p23

αi = log(δi ) and ∂(α) = N .


• The Fisher metric w.r.t. α: gi j = ωαωi ωα
2∂
j = κi j δi .
• Selection of the model parameters:

[u 1 , u 2 , u 3 ] := [δ1 , δ3 , δ5 ] and [v1 , v2 , v3 ] := [δ2 , δ4 , δ6 ].

• A set of vectors ei ⊂ R6 :
⎧ ⎤
δ22 (δ4 − δ6 )
⎨ −δ 2 (δ4 − δ6 ) ⎥
⎨ 2 ⎥
⎨ 0 ⎥

e0 := ⎨ ⎥
−δ δ 2 − 2δ δ δ ⎥ ⊂ (∈u δ),
⎨ 3 5 2 4 6⎥
⎩ 0 ⎫
δ3 δ52 + 2δ2 δ4 δ6

⎧⎧ ⎤ ⎧ ⎤ ⎧ ⎤⎤
δ1 δ1 (−δ1 δ52 + δ3 δ52 ) δ1 (δ1 δ52 − δ3 δ52 )
⎨⎨δ2 ⎥ ⎨δ2 (−δ1 δ 2 − 2δ2 δ4 δ6 )⎥ ⎨δ2 (δ1 δ 2 + 2δ2 δ4 δ6 )⎥⎥
⎨⎨ ⎥ ⎨ 5 ⎥ ⎨ 5 ⎥⎥
⎨⎨δ3 ⎥ ⎨ 0 ⎥ ⎨ 0 ⎥⎥
[e1 , e2 , e3 ] : = ⎨⎨ ⎥,⎨ ⎥,⎨
⎨⎨ 0 ⎥ ⎨ δ4 (δ 2 δ4 − δ 2 δ6 ) ⎥ ⎨δ4 (2δ1 δ3 δ5 + δ 2 δ6 )⎥⎥
⎥⎥
⎨⎨ ⎥ ⎨ 2 2 ⎥ ⎨ 2 ⎥⎥
⎩⎩ 0 ⎫ ⎩ δ5 (δ 2 δ4 + 2δ1 δ3 δ5 ) ⎫ ⎩ 0 ⎫⎫
2
0 0 δ6 (δ2 δ4 + 2δ1 δ3 δ5 )
2

⊂ ((∈u δ)∧Ḡ )3
134 K. Kobayashi and H. P. Wynn

• A vector version of the second-order efficient estimator is, for example,

X − δ + v1 · e1 + v2 · e2 + v3 · e3 + c · v13 · e0 = 0.

• The bias correction term of the estimator = 0.


• A set of estimating equations for MLE:

{x1 δ2 2 δ4 2 δ6 − x1 δ2 2 δ4 δ6 2 − x2 δ1 δ2 δ4 2 δ6 + x2 δ1 δ2 δ4 δ6 2 − 2 x4 δ1 δ2 δ4 δ6 2 −
x4 δ1 δ3 δ5 2 δ6 + 2 x6 δ1 δ2 δ4 2 δ6 + x6 δ1 δ3 δ4 δ5 2 ,
−x2 δ2 δ3 δ4 2 δ6 + x2 δ2 δ3 δ4 δ6 2 + x3 δ2 2 δ4 2 δ6 − x3 δ2 2 δ4 δ6 2 − x4 δ1 δ3 δ5 2 δ6 −
2 x4 δ2 δ3 δ4 δ6 2 + x6 δ1 δ3 δ4 δ5 2 + 2 x6 δ2 δ3 δ4 2 δ6 ,
−2 x4 δ1 δ3 δ5 2 δ6 − x4 δ2 2 δ4 δ5 δ6 + x5 δ2 2 δ4 2 δ6 − x5 δ2 2 δ4 δ6 2 + 2 x6 δ1 δ3 δ4 δ5 2 +
x6 δ2 2 δ4 δ5 δ6 ,
δ1 δ3 δ5 2 − δ2 2 δ4 δ6 , δ1 + δ2 + δ3 − δ4 − δ5 − δ6 , −δ1 − δ2 − δ3 − δ4 − δ5 − δ6 + 1}

The total degree of the equations is 5 × 5 × 5 × 4 × 1 × 1 = 500.


• A set of estimating equations for a 2nd-order efficient estimator with degree at
most 2:

{−3 x1 x2 x4 2 x6 δ2 +6 x1 x2 x4 2 x6 δ6 + x1 x2 x4 2 δ2 δ6 −2 x1 x2 x4 2 δ6 2 +3 x1 x2 x4 x6 2 δ2 −
6 x1 x2 x4 x6 2 δ4 + 2 x1 x2 x4 x6 δ2 δ4 − 2 x1 x2 x4 x6 δ2 δ6 − x1 x2 x6 2 δ2 δ4 + 2 x1 x2 x6 2 δ4 2 +
3 x1 x3 x4 x5 2 δ6 − 2 x1 x3 x4 x5 δ5 δ6 − 3 x1 x3 x5 2 x6 δ4 + 2 x1 x3 x5 x6 δ4 δ5 + x1 x4 2 x6 δ2 2 −
2 x1 x4 2 x6 δ2 δ6 − x1 x4 x5 2 δ3 δ6 − x1 x4 x6 2 δ2 2 + 2 x1 x4 x6 2 δ2 δ4 + x1 x5 2 x6 δ3 δ4 +
3 x2 2 x4 2 x6 δ1 − x2 2 x4 2 δ1 δ6 − 3 x2 2 x4 x6 2 δ1 − 2 x2 2 x4 x6 δ1 δ4 + 2 x2 2 x4 x6 δ1 δ6 +
x2 2 x6 2 δ1 δ4 − x2 x4 2 x6 δ1 δ2 − 2 x2 x4 2 x6 δ1 δ6 + x2 x4 x6 2 δ1 δ2 + 2 x2 x4 x6 2 δ1 δ4 −
x3 x4 x5 2 δ1 δ6 + x3 x5 2 x6 δ1 δ4 ,
3 x1 x3 x4 x5 2 δ6 −2 x1 x3 x4 x5 δ5 δ6 −3 x1 x3 x5 2 x6 δ4 +2 x1 x3 x5 x6 δ4 δ5 − x1 x4 x5 2 δ3 δ6 +
x1 x5 2 x6 δ3 δ4 + 3 x2 2 x4 2 x6 δ3 − x2 2 x4 2 δ3 δ6 − 3 x2 2 x4 x6 2 δ3 − 2 x2 2 x4 x6 δ3 δ4 +
2 x2 2 x4 x6 δ3 δ6 + x2 2 x6 2 δ3 δ4 − 3 x2 x3 x4 2 x6 δ2 + 6 x2 x3 x4 2 x6 δ6 + x2 x3 x4 2 δ2 δ6 −
2 x2 x3 x4 2 δ6 2 +3 x2 x3 x4 x6 2 δ2 −6 x2 x3 x4 x6 2 δ4 +2 x2 x3 x4 x6 δ2 δ4 −2 x2 x3 x4 x6 δ2 δ6 −
x2 x3 x6 2 δ2 δ4 + 2 x2 x3 x6 2 δ4 2 − x2 x4 2 x6 δ2 δ3 − 2 x2 x4 2 x6 δ3 δ6 + x2 x4 x6 2 δ2 δ3 +
2 x2 x4 x6 2 δ3 δ4 +x3 x4 2 x6 δ2 2 −2 x3 x4 2 x6 δ2 δ6 −x3 x4 x5 2 δ1 δ6 −x3 x4 x6 2 δ2 2 +2 x3 x4 x6 2
δ2 δ4 + x3 x5 2 x6 δ1 δ4 ,
6 x1 x3 x4 x5 2 δ6 −4 x1 x3 x4 x5 δ5 δ6 −6 x1 x3 x5 2 x6 δ4 +4 x1 x3 x5 x6 δ4 δ5 −2 x1 x4 x5 2 δ3 δ6 +
2 x1 x5 2 x6 δ3 δ4 + 3 x2 2 x4 2 x6 δ5 − x2 2 x4 2 δ5 δ6 − 3 x2 2 x4 x5 x6 δ4 + 3 x2 2 x4 x5 x6 δ6 +
x2 2 x4 x5 δ4 δ6 − x2 2 x4 x5 δ6 2 − 3 x2 2 x4 x6 2 δ5 − x2 2 x4 x6 δ4 δ5 + x2 2 x4 x6 δ5 δ6 + x2 2 x5
x6 δ4 2 − x2 2 x5 x6 δ4 δ6 + x2 2 x6 2 δ4 δ5 − 2 x2 x4 2 x6 δ2 δ5 + 2 x2 x4 x5 x6 δ2 δ4 − 2 x2 x4 x5 x6
δ2 δ6 + 2 x2 x4 x6 2 δ2 δ5 − 2 x3 x4 x5 2 δ1 δ6 + 2 x3 x5 2 x6 δ1 δ4 ,
δ1 δ3 δ5 2 − δ2 2 δ4 δ6 , δ1 + δ2 + δ3 − δ4 − δ5 − δ6 , −δ1 − δ2 − δ3 − δ4 − δ5 − δ6 + 1}.

The total degree of the polynomial equations is 32.


• A set of estimating equations for a first-order-efficient estimator with degree at
most 1:
6 Computational Algebraic Methods in Efficient Estimation 135

Table 6.1 Computational


Algorithm Estimator #Paths Running time (s)
time for each estimate by the
(avg. ± std.)
homotopy continuation
methods Linear MLE 500 1.137 ± 0.073
Homotopy 2nd eff. 32 0.150 ± 0.047
Polyhedral MLE 64 0.267 ± 0.035
Homotopy 2nd eff. 24 0.119 ± 0.027

{−x5 2 x4 δ6 x1 x3 +x5 2 x6 δ4 x1 x3 +2 x6 2 δ4 x1 x2 x4 −2 x4 2 δ6 x1 x2 x6 −x6 2 x1 x2 δ2 x4 +


x4 2 x1 x2 δ2 x6 + x2 2 x6 2 δ1 x4 − x4 2 x2 2 δ1 x6 ,
−x5 2 x4 δ6 x1 x3 + x5 2 x6 δ4 x1 x3 +2 x6 2 δ4 x2 x3 x4 −2 x4 2 δ6 x2 x3 x6 − x6 2 x2 x3 δ2 x4 +
x4 2 x2 x3 δ2 x6 + x2 2 x6 2 δ3 x4 − x4 2 x2 2 δ3 x6 ,
−2 x5 2 x4 δ6 x1 x3 + 2 x5 2 x6 δ4 x1 x3 − x4 x6 x5 x2 2 δ6 + x4 x5 x2 2 δ4 x6 − x4 2 x2 2 δ5 x6 +
x4 x6 2 x2 2 δ5 ,
δ1 δ3 δ5 2 −δ2 2 δ4 δ6 , δ1 +δ2 +δ3 −δ4 −δ5 −δ6 , −δ1 −δ2 −δ3 −δ4 −δ5 −δ6 +6}.
The estimating equations for a second-order-efficient estimator above look much
more complicated than the estimating equation for the MLE, but each term of the
first three polynomials are at most degree 2. Thanks to this degree reduction, the
computational costs for the estimates become much smaller as we will see in Sect. 6.6.

6.6 Computation

To obtain estimates based on the method of this paper, we need fast algorithms to
find the solution of polynomial equations. The authors have carried out computations
using homotopy continuation method (matlab program HOM4PS2 by Lee, Li and
Tsuai [18]) for the log marginal model in Sect. 6.5.2 and a data X̄ = (1, 1, 1, 1, 1, 1).
The run time to compute each estimate on a standard laptop (Intel(R) Core (TM)
i7-2670QM CPU, 2.20 GHz, 4.00 GB memory) is given by Table 6.1. The computa-
tion is repeated 10 times and the averages and the standard deviations are displayed.
Note the increasing of the speed for the second-order efficient estimators is due to
the degree reduction technique. The term “path” in the table heading refers to a prim-
itive iteration step within the homotopy method. In the faster polyhedron version,
the solution region is subdivided into polyhedral domains.
Figure 6.3 shows the mean squared error and the computational time of the MLE,
the first-order estimator and the second-order efficient estimator of Sect. 6.5.2. The
true parameter is set δ ∗ = (1/6, 1/4, 1/12, 1/12, 1/4, 1/6), a point in the model
manifold, and N random samples are generated i.i.d. from the distribution with the
parameter. The computation is repeated for exponentially increasing sample sizes
N = 1, . . . , 105 . In general, there are multiple roots for polynomial equations
and here we selected the root closest to the sample mean by the Euclidean norm.
Figure 6.3(1) also shows that the mean squared error is approximately the same for
the three estimators, but (2) shows that the computational time is much more for the
MLE.
136 K. Kobayashi and H. P. Wynn

(a) (b)

Fig. 6.3 The mean squared error and computation time for each estimate by the homotopy contin-
uation method

6.7 Discussion

In this paper we have concentrated on reduction of the polynomial degree of the


estimating equations and shown the benefits in computation of the solutions. We do
not expect the estimators to be closed form, such as a rational polynomial form in
the data. The most we can expect is they are algebraic, that is they are the solution
of algebraic equations. They lie on a zero dimensional algebraic variety. It is clear
that there is no escape from using mixed symbolic-numerical methods. In algebraic
statistics the number of solution of the ML equation is called the ML degree. Given
that we have more general estimating equations than pure ML equations this points
to an extended theory or “quasi” ML degree of efficient estimator degree. The issue
of exactly which solution to use as our estimators persists. In the paper we suggest
taking the solution closest to the sufficient statistic in the Euclidian metric. We could
use other metrics and more theory is needed.
Here we have put forward estimating equations with reduced degree and shown the
benefits in terms of computation. But we could have used other criteria for choosing
the equations, while remaining in the efficient class. We might prefer to choose an
equation which reduces the bias further via decreasing the next order term. There
may thus be some trade off between degree and bias.
Beyond the limited ambitions of this paper to look at second-order efficiency lie
several other areas, notably hypothesis testing and model selection. But the question
is the same: to what extent can we bring the algebraic methods to bear, for example by
expressing additional differential forms and curvatures in algebraic terms. Although
estimation typically requires a mixture of symbolic and numeric methods in some
cases only the computation of the efficient estimate requires numeric procedures and
the other computations can be carrying out symbolically.
6 Computational Algebraic Methods in Efficient Estimation 137

Acknowledgments This paper has benefited from conversations with and advice from a number of
colleagues. We should thank Satoshi Kuriki, Tomonari Sei, Wicher Bergsma and Wilfred Kendall.
The first author acknowledges support by JSPS KAKENHI Grant 20700258, 24700288 and the
second author acknowledges support from the Institute of Statistical Mathematics for two visits
in 2012 and 2013 and from UK EPSRC Grant EP/H007377/1. A first version of this paper was
delivered at the WOGAS3 meeting at the University of Warwick in 2011. We thank the sponsors.
The authors also thank the referees of the short version in GSI2013 and the referees of the first long
version of the paper for insightful suggestions.

A Normal Forms

A basic text for the materials in this section is [9]. The rapid growth of modern
computational algebra can be credited to the celebrated Buchberger’s algorithm [8].
A monomial ideal I in a polynomial ring K [x1 , . . . , xn ] over a field K is an ideal
for which there is a collection of monomials f 1 , . . . , f m such that any g ⊂ I can be
expressed as a sum


m
g= gi (x) f i (x)
i=1

with some polynomials gi ⊂ K [x1 , . . . , xn ]. We can appeal to the representation of


a monomial x β = x1β1 . . . xnβn by its exponent β = (β1 , . . . , βn ). If γ ⊆ 0 is another
exponent then
x β x γ = x β+γ ,

and β + γ is in the positive (shorthand for non-negative) “orthant” with corner at


β. The set of all monomials in a monomial ideal is the union of all positive orthants
whose “corners” are given by the exponent vectors of the generating monomial
f 1 , . . . , f m . A monomial ordering written x β ↔ x γ is a total (linear) ordering on
monomials such that for ξ ⊆ 0, x β ↔ x γ √ x β+ξ ↔ x γ+ξ . Any polynomial f (x)
has a leading terms with respect to ↔, written L T ( f ).
There are, in general, many ways to express a given ideal I as being generated
from a basis I = ≤ f 1 , . . . , f m . That is to say, there are many choices of basis. Given
an ideal I a set {g1 , . . . gm } is called a Gröbner basis (G-basis) if:

≤L T (g1 ), . . . , L T (gm ) = ≤L T (I ),

where ≤L T (I ) is the ideal generated by all the monomials in I . We sometimes


refer to ≤L T (I ) as the leading term ideal. Any ideal I has a Gröbner basis and any
Gröbner basis in the ideal is a basis of the ideal.
Given a monomial ordering and an ideal expressed in terms of the G-basis, I =
≤g1 , . . . , gm , any polynomial f has a unique remainder with respect the quotient
operation K [x1 , . . . , xk ]/I . That is
138 K. Kobayashi and H. P. Wynn


m
f = si (x)gi (x) + r (x).
i=1

We call the remainder r (x) the normal form of f with respect to I and write N F( f ).
Or, to stress the fact that it may depend on ↔, we write N F( f, ↔). Given a monomial
ordering ↔, a polynomial f = β⊂L αβ x β for some L is a normal form with respect
to ↔ if x β ⊂/ ≤L T ( f ) for all β ⊂ L. An equivalent way of saying this is: given an
ideal I and a monomial ordering ↔, for every f ⊂ K [x1 , . . . , xk ] there is a unique
normal form N F( f ) such that f − N F( f ) ⊂ I .

B Homotopy Continuation Method

Homotopy continuation method is an algorithm to find the solutions of simultaneous


polynomial equations numerically. See, for example, [19, 24] for more details of the
algorithm and theory.
We will explain the method briefly by a simple example of 2 equations with 2
unknowns
Input: f, g ⊂ R[x, y]

Output: The solutions of f (x, y) = g(x, y) = 0.

Step 1 Select arbitrary polynomials of the form:

f 0 (x, y) := f 0 (x) := a1 x d1 − b1 = 0,
g0 (x, y) := g0 (y) := a2 y d2 − b2 = 0 (6.8)

where d1 = deg( f ) and d2 = deg(g). Polynomial equations in this form are


easy to solve.
Step 2 Take the convex combinations:

f t (x, y) := t f (x, y) + (1 − t) f 0 (x, y),


gt (x, y) := tg(x, y) + (1 − t)g0 (x, y)

then our target becomes the solution for t = 1.


Step 3 Compute the solution for t = κ for small κ by the solution for t = 0 numer-
ically.
Step 4 Repeat this until we obtain the solution for t = 1.

Figure 6.4 shows a sketch of the algorithm. This algorithm is called the (linear)
homotopy continuation method and justified if the path connects t = 0 and t = 1
6 Computational Algebraic Methods in Efficient Estimation 139

Fig. 6.4 Paths for the homotopy continuation method

continuously without an intersection. That can be proved for almost all a and b.
See [19].
For each computation for the homotopy continuation method, the number of the
paths is the number of the solutions of (6.8). In this
case, the number of paths is d1 d2 .
m
In general case with m unknowns, it becomes i=1 di and this causes a serious
problem for computational cost. Therefore decreasing the degree of second-order
efficient estimators plays an important role for the homotopy continuation method.
Note that in order to solve this computational problem, the authors of [16] pro-
posed the nonlinear homotopy continuation methods (or the polyhedral continuation
methods). But as we can see in Sect. 6.5.2, the degree of the polynomials still affects
the computational costs.

References

1. Adler, R.J., Taylor, J.E.: Random Fields and Geometry. Springer Monographs in Mathematics.
Springer, New York (2007)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical
Society, Providence (2007)
3. Amari, S.: Differential geometry of curved exponential families-curvatures and information
loss. Ann. Stat. 10, 357–385 (1982)
4. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, New York (1985)
5. Amari, S., Kumon, M.: Differential geometry of Edgeworth expansions in curved exponential
family. Ann. Inst. Stat. Math. 35(1), 1–24 (1983)
6. Andersson, S., Madsen, J.: Symmetry and lattice conditional independence in a multivariate
normal distribution. Ann. Stat. 26(2), 525–572 (1998)
7. Bergsma, W.P., Croon, M., Hagenaars, J.A.: Marginal Models for Dependent, Clustered, and
Longitudinal Categorical Data. Springer, New York (2009)
8. Buchberger, B.: Bruno Buchberger’s PhD thesis 1965: an algorithm for finding the basis ele-
ments of the residue class ring of a zero dimensional polynomial ideal. J. Symbol. Comput.
41(3), 475–511 (2006)
9. Cox, D.A., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms: An Introduction to Com-
putational Algebraic Geometry and Commutative Algebra, 3/e (Undergraduate Texts in Math-
ematics). Springer, New York (2007)
140 K. Kobayashi and H. P. Wynn

10. Dawid, A.P.: Further comments on some comments on a paper by Bradley Efron. Ann. Stat.
5(6), 1249 (1977)
11. Drton, M., Sturmfels B., Sullivant, S.: Lectures on Algebraic Statistics. Springer, New York
(2009)
12. Drton, M.: Likelihood ratio tests and singularities. Ann. Stat. 37, 979–1012 (2009)
13. Efron, B.: Defining the curvature of a statistical problem (with applications to second order
efficiency). Ann. Stat. 3, 1189–1242 (1975)
14. Gehrmann, H., Lauritzen, S.L.: Estimation of means in graphical Gaussian models with sym-
metries. Ann. Stat. 40(2), 1061–1073 (2012)
15. Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P.: Algebraic and Geometric Methods
in Statistics. Cambridge University Press, Cambridge (2009)
16. Huber, B., Sturmfels, B.: A polyhedral method for solving sparse polynomial systems. Math.
Comput. 64, 1541–1555 (1995)
17. Kuriki, S., Takemura, A.: On the equivalence of the tube and Euler characteristic methods for
the distribution of the maximum of Gaussian fields over piecewise smooth domains. Ann. Appl.
Probab. 12(2), 768–796 (2002)
18. Lee, T.L., Li, T.Y., Tsai, C.H.: HOM4PS2.0: a software package for solving polynomial systems
by the polyhedral homotopy continuation method. Computing 83(2–3), 109–133 (2008)
19. Li, T.Y.: Numerical solution of multivariate polynomial systems by homotopy continuation
methods. Acta Numer 6(1), 399–436 (1997)
20. Naiman, D.Q.: Conservative confidence bands in curvilinear regression. Ann. Stat. 14(3), 896–
906 (1986)
21. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the
probability measures equivalent to a given one. Ann. Stat. 23, 1543–1561 (1995)
22. Pistone, G., Wynn, H.P.: Generalised confounding with Gröbner bases. Biometrika 83, 653–666
(1996)
23. Rao, R.C.: Information and accuracy attainable in the estimation of statistical parameters. Bull.
Calcutta Math. Soc. 37(3), 81–91 (1945)
24. Verschelde, J.: Algorithm 795: PHCpack: a general-purpose solver for polynomial systems by
homotopy continuation. ACM Trans. Math. Softw. (TOMS) 25(2), 251–276 (1999)
25. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori,
T. (eds.) Algorithmic Learning Theory. Springer, Berlin (1999)
26. Weyl, H.: On the volume of tubes. Am. J. Math. 61, 461–472 (1939)
Chapter 7
Eidetic Reduction of Information Geometry
Through Legendre Duality of Koszul
Characteristic Function and Entropy: From
Massieu–Duhem Potentials to Geometric
Souriau Temperature and Balian Quantum
Fisher Metric

Frédéric Barbaresco

Abstract Based on Koszul theory of sharp convex cone and its hessian geometry,
Information Geometry metric is introduced by Koszul form as hessian of Koszul–
Vinberg Characteristic function logarithm (KVCFL). The front of the Legendre map-
ping of this KVCFL is the graph of a convex function, the Legendre transform of this
KVCFL. By analogy in thermodynamic with Dual Massieu–Duhem potentials (Free
Energy and Entropy), the Legendre transform of KVCFL is interpreted as a “Koszul
Entropy”. This Legendre duality is considered in more general framework of Contact
Geometry, the odd-dimensional twin of symplectic geometry, with Legendre fibra-
tion and mapping. Other analogies will be introduced with large deviation theory
with Cumulant Generating and Rate functions (Legendre duality by Laplace Princi-
ple) and with Legendre duality in Mechanics between Hamiltonian and Lagrangian.
In all these domains, we observe that the “Characteristic function” and its derivatives
capture all information of random variable, system or physical model. We present
two extensions of this theory with Souriau’s Geometric Temperature deduced from
covariant definition of thermodynamic equilibriums, and with Balian quantum Fisher
metric defined and built as hessian of von Neumann entropy. Finally, we apply Koszul
geometry for Symmetric/Hermitian Positive Definite Matrices cones, and more par-
ticularly for covariance matrices of stationary signal that are characterized by spe-
cific matrix structures: Toeplitz Hermitian Positive Definite Matrix structure (covari-
ance matrix of a stationary time series) or Toeplitz-Block-Toeplitz Hermitian Posi-
tive Definite Matrix structure (covariance matrix of a stationary space–time series).

F. Barbaresco (B)
Thales Air Systems, Voie Pierre-Gilles de Gennes, F91470 Limours, France
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 141


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_7,
© Springer International Publishing Switzerland 2014
142 F. Barbaresco

By extension, we introduce a new geometry for non-stationary signal through Fréchet


metric space of geodesic paths on structured matrix manifolds. We conclude with
extensions towards two general concepts of “Generating Inner Product” and “Gen-
erating Function”.

Keywords Koszul characteristic function · Koszul entropy · Koszul forms · Laplace


principle · Massieu–Duhem potential · Projective legendre duality · Contact geom-
etry · Information geometry · Souriau geometric temperature · Balian quantum
information metric · Cartan–Siegel homogeneous bounded domains · Generating
function

7.1 Preamble

The Koszul–Vinberg Characteristic Function (KVCF) is a dense knot in important


mathematical fields such as Hessian Geometry, Kählerian Geometry, Affine Differ-
ential Geometry and Information Geometry. As essence of Information Geometry,
this paper develops KVCF as transverse concept in Thermodynamics, in Probability,
in Large Deviation Theory, in Statistical Physics and in Quantum Physics. From
general KVCF definition, the paper defines Koszul Entropy, that coincides with the
Legendre transform of minus the logarithm of KVCF, and interprets both by analogy
in thermodynamic with Dual Massieu–Duhem potentials.
In a second step, the paper proposes to synthesize and unified the three following
models: Koszul model (with Koszul entropy) in Hessian Geometry, Souriau model
in statistical physics (with new notion of geometric temperature) and Balian model
in quantum physics (with Fisher metric as hessian of von Neumann entropy). These
three models are formally similar in the characteristic function, the entropy, Legendre
transform, dual coordinate system, Hessian metric and Fisher metric. These facts
suggest that there exist close inter-relation and beautiful inner principle through
these fields of research.
Finally, the paper studies the space of symmetric/hermitian positive definite matri-
ces from view point of the Koszul theory though KVCF tool and extend the theory
for covariance matrices for stationary and non-stationary signals characterized by
Toeplitz matrix structure.
The paper concludes with extensions towards two general concepts of “Generating
Inner Product” and “Generating Function”. This tutorial paper will help readers to
have a synthetic view of different domains of science where Hessian metric manifold
is the essence, emerging from Information Geometry Theory.
This paper is not only a survey of different fields of science unified by Koszul
theory, but some new results are introduced:
7 Eidetic Reduction of Information Geometry 143

• Computation of Koszul
 Entropy from general definition of Koszul Characteristic
function ΔΦ (x) = Φ⊂ e−∗Γ,x∈ dΓ ∀x ≡ Φ and Legendre transform:

 
Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x) with x ⊂ = Dx Ω and x = Dx⊂ Ω⊂ where Ω(x) = − log ΔΦ (x)
⎛ ⎠
⎞ ⎞ ⎞
Ω (x ) = Ω ⎝ Γ · px (Γ )dΓ ⎧ = − px (Γ ) log px (Γ )dΓ with x ⊂ =
⊂ ⊂ ⊂ Γ · px (Γ )dΓ
Φ⊂ Φ⊂ Φ⊂

• Definition of Koszul density by analogy of Koszul Entropy with Shannon Entropy


 
e− Γ,Ψ (Γ̄ )
−1
dΩ(x)
pΓ̄ (Γ ) =  with x = Ψ−1 (Γ̄ ) and Γ̄ = Ψ(x) =
e−∗ Γ,Ψ −1 (Γ̄ )∈
dΓ dx
Φ⊂
 
where Γ̄ = Φ⊂ Γ · pΓ̄ (Γ )dΓ and Ω(x) = − log Φ⊂ e−∗x,Γ ∈ dΓ
• Property
⎨ ⎩that Barycenter of Koszul Entropy is equal to Entropy of Barycenter
E Ω⊂ (Γ ) = Ω⊂ (E [Γ ]) , ∀Γ ≡ Φ⊂
⎛ ⎠
⎞ ⎞
Ω⊂ (Γ )px (Γ )dΓ = Ω⊂ ⎝ Γ.px (Γ )dΓ ⎧ ,∀Γ ≡ Φ⊂
Φ⊂ Φ⊂

• Property that Koszul density is the classical Maximum Entropy solution

⎤ ⎫ ⎭
⎞  px (Γ )dΓ = 1
Max ⎥− px (Γ ) log px (Γ )dΓ ⎬ such Φ

px (.)  Γ · px (Γ )dΓ = x ⊂
Φ⊂ Φ⊂
⎭ −∗x,Γ ∈
 px (Γ ) =  e
e−∗x,Γ ∈ dΓ
∇ Φ⊂

x = Ψ−1 (x ⊂ ), Ψ(x) = dΩ(x)
dx

• Strict equivalence between Hessian Koszul metric and Fisher metric from Infor-
mation Geometry
⎜ 
∂ 2 log px (Γ ) ∂ 2 log ΔΦ (x)
I(x) = −EΓ =
∂x 2 ∂x 2

• Interpretation of Information Geometry Dual Potentials in the framework of Con-


tact Geometry through the concept of Legendre fibration and Legendre mapping.
• Computation of Koszul Density for Symmetric/Hermitian Positive Definite
Matrices
144 F. Barbaresco

 ⎟ n+1 n+1  −1  ⎪  n+1


−1 2 n+1 2
px (Γ ) = cΓ̄ · βK det Γ e− 2 Tr Γ̄ Γ with βK = and
2

Γ̄ = Γ · px (Γ ) · dΓ
Φ⊂

• Computation of Koszul Metric for (Toeplitz-Block) Toeplitz Hermitian Positive


Definite covariante matrices of stationary signal by means of Trench/Verblunsky
Theorem and Complex Autoregressive model; that proves that a stationary space–
time state of an Electromagnetic Field is coded by one point in product space
RxDm−1 xSDn−1 where D is the Poincaré unit disk, and SD is the Siegel unit disk.
The metric is given by:

m−1 |dμi |2
ds2 = n · m · (d log (P0 ))2 + n (m − i)  2
i=1
1 − |μi |2

n−1 ⎜ ⎟−1  ⎟−1 
+ (n − k)Tr I − Akk Ak+
k dAkk k+ k
I − Ak Ak k+
dAk
k=1
 ⎟  
R0 , A11 , . . . , An−1 n−1 with SD = Z/ZZ + < I
with n−1 ≡ THPDm × SD m
R0 ◦ (log (P0 ) , μ1 , . . . , μm−1 ) ≡ R × Dm−1 with D = {z/zz⊂ < 1}
• New model of non-stationary signal where one time series can be split into sev-
eral stationary signals on a shorter time scale, represented by time sequence of
stationary covariance matrices or a geodesic polygon/path on covariance matrix
manifold.
• Definition of distance between non-stationary signals as Fréchet distance between
two geodesic paths in abstract metric spaces of covariance matrix manifold.
⎭  

 dFr échet (R1 , R2 ) = Inf t≡[0,1]
Max dgeo (R1 (α(t)), R2 (β(t)))
α,β
 ⎟

 with dgeo2 (R (α(t)), R (β(t)))) = log R−1/2 (α(t))R (β(t))R−1/2 (α(t))
2
1 2 1 2 1

• Introduction of a modified Fréchet distance to take into account close dependence


of elements between points of time series paths
⎭1 
⎞ 
dgeo-path (R1 , R2 ) = Inf dgeo (R1 (α(t)), R2 (β(t))) dt
α,β  
0

We will begin this exposition by a global view of all geometries that are interrelated
through the cornerstone concept of Koszul–Vinberg characteristic function and
metric.
7 Eidetic Reduction of Information Geometry 145

7.2 Characteristic Function in Geometry/Statistic/


Thermodynamic

In the title of the paper “Eidetic Reduction” is a reference to Husserl phenomenology


[149], but should be understood in the context of following Bergson’s definition
of “Eidos” (synonymous of “Idea” or “morphe” in greek philosophy), emerging
from his philosophy during this time period called by Frédéric Worms [152–154]
“ Moment 1900” which is still decisive, a hundred years later. In 1907, Henri Bergson
(Philosopher, the 1927 Nobel Prize in Literature, Professor at College de France,
member of Académie Française, awarded in 1930 by highest French honour, the
Grand-Croix de la Légion d’honneur) gave his definition of greek concept of “Eidos”
in the book “Creative Evolution” [150, 151]:
The word, eidos, which we translate here by “Idea”, has, in fact, this threefold meaning.
It denotes (1) the quality, (2) the form or essence, (3) the end or design (in the sense of
intention) of the act being performed, that is to say, at bottom, the design (in the sense of
drawing) of the act supposed accomplished. These three aspects are those of the adjective,
substantive and verb, and correspond to the three essential categories of language. After the
explanations we have given above, we might, and perhaps we ought to, translated eidos by
“view” or rather by “moment.” For eidos is the stable view taken of the instability of things:
the quality, which is a moment of becoming; the form which is a moment of evolution; the
essence, which is the mean form above and below which the other forms are arranged as
alterations of the mean; finally, the intention or mental design which presides over the action
being accomplished, and which is nothing else, we said, than the material design, traced
out and contemplated beforehand, of the action accomplished. To reduce things to Ideas is
therefore to resolve becoming into its principal moments, each of these being, moreover,
by the hypothesis, screened from the laws of time and, as it were, plucked out of eternity.
That is to say that we end in the philosophy of Ideas when we apply the cinematographical
mechanism of the intellect to the analysis of the real.

The paper title should be then considered as echo of this Bergson’s citation on
“Eidos” with twofold meaning in our development: “Eidos” considered with meaning
of “equilibrium state” as defined by Bergson “Eidos is the stable view taken of
the instability of things”, but also “Eidos” thought as “Eidetic Reduction” about
“Essence” of Information Geometry from where we will try to favor the emergence of
“Koszul characteristic function” concept. Eidetic reduction considered as the study of
“Information Geometry” reduced into a necessary essence, where will be identified
its basic and invariable components that will be declined in different domains of
Science (mathematics, physics, information theory,…).
As essence of Information Geometry, we will develop the “Koszul Characteris-
tic Function”, transverse concept in Thermodynamics (linked with Massieu–Duhem
Potentials), in Probability (linked with Poincaré generating moment function), in
Large Deviations Theory (linked with Laplace Principle), in Mechanics (linked with
Contact Geometry), in Geometric Physics (linked with Souriau Geometric temper-
ature in Mechanical Symplectic geometry) and in Quantum Physics (linked with
Balian Quantum hessian metric). We will try to explore close inter-relations between
these domains through geometric tools developed by Jean-Louis Koszul (Koszul
forms, …).
146 F. Barbaresco

Fig. 7.1 Text of Poincaré lecture on thermodynamic with development of the concept of “Massieu
characteristic function”

First, we will observe that derivatives of the  Koszul–Vinberg Characteristic


Function Logarithm (KVCFL) log ΔΦ (x) = log Φ⊂ e−∗Γ,x∈ dΓ are invariant by the
automorphisms of the convex cone Φ, and so KVCFL Hessian defines naturally a
Riemannian metric, that is the basic root of development in Information Geometry.
In thermodynamic, François Massieu was the first to introduce the concept of char-
acteristic function φ (or Massieu–Duhem potential). This characteristic function or
thermodynamic potential is able to provide all body properties from their derivatives.
Among all thermodynamic Massieu–Duhem potentials, Entropy S is derived from
Legendre–Moreau transform of characteristic function logarithm φ: S = φ − β · ∂φ ∂β
with β = kT 1
the thermodynamic temperature.
The most popular notion of “characteristic function” was first introduced by Henri
Poincaré in his lecture on probability [40], using the property that all moments of sta-
tistical laws could be deduced from its derivatives. Paul Levy has made a systematic
use of this concept. We assume that Poincaré was influenced by his school fellow
at Ecole des Mines de Paris, François Massieu, and his work on thermodynamic
potentials (generalized by Pierre Duhem in an Energetic Theory). This assertion is
corroborated by the observation that Poincaré had added in his lecture on thermo-
dynamics in second edition [39], one chapter on “Massieu characteristic function”
with many developments and applications (Fig. 7.1).
In Large Deviation Theory, Laplace Principle allows to introduce the (scaled
 nxς Generating Function φ (x) = Limn◦→ n log Δn (x) with Δn (x)
1
cumulant)
= e p ( n = ς) dς as the more natural element to characterize behavior of
statistical variable for large deviations. By use of Laplace principle and the fact that
the asymptotic behavior of an integral is determined by the largest value of the inte-
grand, it can be proved that the (scaled cumulant) Generating Function φ (x) is the
Legendre transform of the rate Function I(ς ) = Limn◦→ − n1 log P ( n ≡ dς ).
For all these cases, we can observe that the “Characteristic function” and its
derivatives capture all information of random variable, system or physical model.
Furthermore, the general notion of Entropy could be naturally defined by the Legen-
dre Transform of the Koszul characteristic function logarithm. In the general case,
Legendre transform of KVCFL will be designated as “Koszul Entropy”.
This general notion of “characteristic function” has been generalized in two dif-
ferent directions by physicists Jean-Marie Souriau and Roger Balian.
In 1970, Jean-Marie Souriau, that had been student of Elie Cartan, introduced the
concept of coadjoint action of a group on its momentum space (or “moment map”),
7 Eidetic Reduction of Information Geometry 147

based on the orbit method works, that allows to define physical observables like
energy, momentum as pure geometrical objects. For Jean-Marie Souriau, equilibri-
ums states are indexed by a geometric parameter β with values in the Lie algebra
of the Lorentz–Poincaré group. Souriau approach generalizes the Gibbs equilibrium
states, β playing the role of temperature. The invariance with respect to the group, and
the fact that the entropy S is a convex function of β, imposes very strict conditions,
that allow Souriau to interpreted β as a space–time vector (the temperature vector
of Planck), giving to the metric tensor g a null Lie derivative. In our development,
before exposing Souriau theory, we will introduce Legendre Duality between the
variational Euler–Lagrange and the symplectic Hamilton–Jacobi formulations of the
equations of motion and will introduce Cartan–Poincaré invariant.
In 1986, Roger Balian has introduced a natural metric structure for the space of
states D̂ in quantum mechanics, from which we can deduce the distance between a
state and one of its approximations. Based on quantum information theory, Roger
% metric ds = d S(D̂) from
Balian has built on physical grounds this metric$ as hessian 2 2

S(D̂), von Neumann’s entropy S(D̂) = −Tr D̂ ln(D̂) . Balian has then recovered
same relations than in classical statistical physics, with also a “quantum character-
istic function” logarithm
 ⎟ F( & X̂) '= ln Tr exp X̂, Legendre transform of von-Neuman
Entropy S(D̂) = F X̂ − D̂, X̂ . In this framework, Balian has introduced the notion
of “Relevant Entropy”.
We will synthetize all these analogies in a table for the three models of Koszul,
Souriau and Balian (Information Geometry case being a particular case of Koszul
geometry).
In last chapters, we will apply the Koszul theory for defining geometry of Sym-
metric/Hermitian Positive Definite Matrices, and more particularly for covariance
matrix of stationary signal that are characterized by specific matrix structures:
Toeplitz Hermitian Positive Definite Matrix structure (covariance matrix of a sta-
tionary time series) or Toeplitz-Block-Toeplitz Hermitian Positive Definite Matrix
structure (covariance matrix of a stationary space–time series). We will see that
“Toeplitz” matrix structure could be captured by complex autogressive model para-
meterization. This parameterization could be naturally introduced without arbitrary
through Trench’s theorem (or equivalently Verblunsky’s theorem) or through Par-
tial Iwasawa decomposition. By extension, we introduce a new geometry for non-
stationary signal through Fréchet metric space of geodesic paths on structured matrix
manifolds.
We conclude with two general concepts of “Generating Inner Product” and “Gen-
erating Function” that extend previous developments.
The Koszul–Vinberg characteristic function is a dense knot in mathematics and
could be introduced in the framework of different geometries: Hessian Geometry
(Jean-Louis Koszul work), Homogeneous convex cones geometry (Ernest Vinberg
work), Homogeneous Symmetric Bounded Domains Geometry (Eli Cartan and Carl
Ludwig Siegel works), Symplectic Geometry (Thomas von Friedrich and Jean-Marie
Souriau work), Affine Geometry (Takeshi Sasaki and Eugenio Calabi works) and
Information Geometry (Calyampudi Rao [193] and Nikolai Chentsov [194] works).
148 F. Barbaresco

Fig. 7.2 Landscape of geo-


metric science of information
and key cornerstone position
of “Koszul–Vinberg charac-
teristic function”

Through Legendre duality, Contact Geometry (Vladimir Arnold work) is considered


as the odd-dimensional twin of symplectic geometry and could be used to understand
Legendre mapping in Information Geometry. Fisher metric of Information Geometry
could be introduced as hessian metric from Koszul–Vinberg characteristic function
logarithm or from Koszul Entropy (Legendre transform of Koszul–Vinberg charac-
teristic function logarithm).
In a more general context, we can consider Information Geometry in the frame-
work of “Geometric Science of Information” (see first SEE/SMF GSI’13 conference
on this topic: http://www.gsi2013.org, organized at Ecole des Mines de Paris in
August 2013). Geometric Science of Information also includes Probability in Metric
Space (Maurice Fréchet work), Probability/Geometry on structures (Yann Ollivier
and Misha Gromov works) and Probability on Riemannian Manifold (Michel Emery
and Marc Arnaudon works).
In Fig. 7.2, we give the general landscape of “Geometric Science of Informa-
tion” where “Koszul–Vinberg characteristic function” appears as a key cornerstone
between different geometries.

7.3 Legendre Duality and Projective Duality

In following chapters, we will see that Logarithm of Characteristic function and


Entropy will be related by Legendre transform, that we can consider in the context of
projective duality. Duality is an old and very fruitful Idea (“Eidos”) in Mathematics
that has been constantly generalized [59–65]. A duality translates concepts, theorems
or mathematical structures into other concepts, theorems or structures, in a one-to-
one fashion, often by means of an involution operation and sometimes with fixed
points.
Most simple duality is linear duality in the plane with points and lines (two
different points can be joined by a unique line. Two different lines meet in one point
unless they are parallel). By adding some points at infinity (to avoid particular case
7 Eidetic Reduction of Information Geometry 149

Fig. 7.3 (On the left) Pascal’s


theorem, (on the right) Brian-
chon’s theorem

of parallel lines) then we obtain the projective plane in which the duality is given
symmetrical relationship between points and lines, and led to the classical principle
of projective duality, where the dual theorem is also a theorem.
Most Famous example is given by Pascal’s theorem (the Hexagrammum Mys-
ticum Theorem) stating that:
• If the vertices of a simple hexagon are points of a point conic, then its diagonal
points are collinear: If an arbitrary six points are chosen on a conic (i.e., ellipse,
parabola or hyperbola) and joined by line segments in any order to form a hexagon,
then the three pairs of opposite sides of the hexagon (extended if necessary) meet
in three points which lie on a straight line, called the Pascal line of the hexagon
The dual of Pascal’s Theorem is known as Brianchon’s Theorem:

• If the sides of a simple hexagon are lines of a line conic, then the diagonal lines
are concurrent (Fig. 7.3).

The Legendre(–Moreau) transform [158, 165] is an operation from convex func-


tions on a vector space to functions on the dual space. The Legendre transform is
related to projective duality and tangential coordinates in algebraic geometry, and
to the construction of dual Banach spaces in analysis. Classical Legendre transform
in Euclidean space is given by fixing a scalar product ∗., .∈on Rn . For a function
F: Rn ◦ R ∅ {±→} let:

G(y) = LF(y) = Sup {∗y, x∈ − F(x)} (7.1)


x

This is an involution on the class of convex lower semi-continuous functions


on Rn .
There are two dual possibilities to describe a function. We can either use a function,
or we may regard the curve as the envelope of its tangent planes (Fig. 7.4).
In the framework of Information geometry, Bregman divergence appears as essen-
tially the canonical divergence on a dually flat manifold equipped with a pair of
biorthogonal coordinates induced from a pair of “potential functions” under the
Legendre transform [159].
To illustrate the role of Legendre transform in Information Geometry, we provide
a canonical example, with the relations for the Multivariate Normal Gaussian Law
N (m, R):
150 F. Barbaresco

Fig. 7.4 Legendre transform


G(y) of F(x)

• Dual Coordinates systems:


(  
Ψ̃ = (θ, Ψ) = R−1 m, (2R)−1 
(7.2)
H̃ = (η, H) = m, −R + mmT

• Dual potential functions:


⎭  ⎟  
 ˜ Ψ̃ = 2− Tr Ψ−1 θ θ T − 2−1 log (det Ψ) + 2−1 n log(2π e)
 ⎟   .
 Ω̃ H̃ = −2−1 log 1 + ηT H −1 η − 2−1 log (det(−H)) − 2−1 n log (2π e)
(7.3)
related by Legendre transform:
 ⎟ & '  ⎟ & '  ⎟
˜ Ψ̃ with Ψ̃, H̃ = Tr θ ηT + ΨH T
Ω̃ H̃ = Ψ̃, H̃ −  (7.4)

 
˜
∂ ∂ Ω̃
∂θ =η ∂η =θ
˜
∂
and (7.5)
=H ∂ Ω̃
∂Ψ ∂H =Ψ
 ⎟ ⎨ ⎩
with Ω̃ H̃ = E log p the Entropy.
In the theory of Information Geometry introduced by Rao [193] and Chentsov
[194], a Riemannian manifold is then defined by a metric tensor given by hessian of
these dual potential functions:

˜
∂ 2 ∂ 2 Ω̃
gij = and gij⊂ = (7.6)
∂ Ψ̃i ∂ Ψ̃j ∂ H̃i ∂ H̃j
7 Eidetic Reduction of Information Geometry 151

One of the important concepts in information geometry is mutually dual


(conjugate) affine connections, and could be studied in the framework of Hessian or
affine differential geometry. For more details and other developments on “dual affine
connections” and “alpha-connections”, we invite you to read the two books by Shun
Ishi Amari [191, 192].
In this paper, we will not develop concepts of dual affine connections, but the
“hessian manifolds” theory that was initially studied by Jean-Louis Koszul in a
more general framework. In Sect. 7.4, we will expose theory of Koszul–Vinberg
characteristic function on convex sharp cones that will be presented as a general
framework of Information geometry.

7.4 Koszul Characteristic Function/Entropy


by Legendre Duality

We define Koszul–Vinberg hessian metric on convex sharp cone, and observe that
the Fisher information metric of Information Geometry coincides with the canonical
Koszul Hessian metric (Koszul form) [1, 27–32]. We also observe, by Legendre
duality (Legendre transform of Koszul characteristic function logarithm), that we are
able to introduce a Koszul Entropy, that plays the role of general Entropy definition.

7.4.1 Koszul–Vinberg Characteristic Function


and Metric of Convex Sharp Cone

Koszul [1, 27, 32] and Vinberg [33, 146] have introduced an affinely invariant
Hessian metric on a sharp convex cone Φ through its characteristic function Δ.
In the following, Φ is a sharp open convex cone in a vector space E of finite dimen-
sion on R (a convex cone is sharp if it does not contain any full straight line). In dual
space E ⊂ of E, Φ⊂ is the set of linear strictly positive forms on Φ − {0} and Φ⊂ is
the dual cone of Φ and is a sharp open convex cone. If Γ ≡ Φ⊂ , then the intersection
Φ ← {x ≡ E/ ∗x, Γ ∈ = 1} is bounded. G = Aut(Φ) is the group of linear transform
of E that preserves Φ. G = Aut(Φ) operates on Φ⊂ by ∀g ≡ G = Aut(Φ) , ∀Γ ≡ E ⊂
then g̃ · Γ = Γ ≤ g−1 .
Koszul–Vinberg Characteristic function definition: Let dΓ be the Lebesgue mea-
sure on E ⊂ , the following integral:

ΔΦ (x) = e−∗Γ,x∈ dΓ ∀x ≡ Φ (7.7)
Φ⊂
152 F. Barbaresco

with Φ⊂ the dual cone is an analytic function on Φ, withΔΦ (x) ≡ ] 0, +→ [ , called


the Koszul–Vinberg characteristic function of cone Φ, with the properties:
• The Bergman kernel of Φ + iRn+1 is written as KΦ (Re(z)) up to a constant where
KΦ is defined by the integral:

KΦ (x) = e−∗Γ,x∈ ΔΦ⊂ (Γ )−1 dΓ (7.8)
Φ⊂

• ΔΦ is analytic function defined on the interior of Φ and ΔΦ (x) ◦ +→ as x ◦ ∂Φ


If g ≡ Aut(Φ) then ΔΦ (gx) = |det g|−1 ΔΦ (x) and since tI ≡ G = Aut(Φ) for
any t > 0, we have
ΔΦ (tx) = ΔΦ (x)/t n (7.9)

• ΔΦ is logarithmically strictly convex, and ϕΦ (x) = log (ΔΦ (x)) is strictly convex
Koszul 1-form α: The differential 1-form

α = dϕΦ = d log ΔΦ = dΔΦ /ΔΦ (7.10)

is invariant by all automorphisms G = Aut(Φ) of Φ. If x ≡ Φ and u ≡ E then



∗αx , u∈ = − ∗Γ, u∈ · e−∗Γ,x∈ dΓ and αx ≡ −Φ⊂ (7.11)
Φ⊂

Koszul 2-form β: The symmetric differential 2-form

β = Dα = d 2 log ΔΦ (7.12)

is a positive definite symmetric bilinear form on E invariant under G = Aut(Φ).



Dα > 0 (Schwarz inequality and d 2 log ΔΦ (u, v) = ∗Γ, u∈ ∗Γ, v∈ e−∗Γ,u∈ dΓ )
Φ⊂
(7.13)
Koszul–Vinberg Metric: Dα defines a Riemanian structure invariant by Aut(Φ), and
then the Riemanian metric is given by

g = d 2 log ΔΦ (7.14)

⎜ ⎞  
Δu d 2 log Δu du
d log Δ(x) = d
2 2
log Δu du = 
Δu du

1 Δu Δv (d log Δu − d log Δv )2 dudv
+ 
2 Δu Δv dudv
7 Eidetic Reduction of Information Geometry 153

A diffeomorphism is used to define dual coordinate:

x ⊂ = −αx = −d log ΔΦ (x) (7.15)


)
d)
with ∗df (x), u∈ = Du f (x) = dt )t=0 f (x + tu). When the cone Φ is symmet-
ric, the map x⊂ = −αx is a bijection and an isometry with a unique fixed
point (the manifold is a Riemannian Symmetric Space given by this isometry):
(x ⊂ )⊂ = x, ∗x, x ⊂ ∈ = n and ΔΦ (x)ΔΦ⊂ (x ⊂ ) = cste · x ⊂ is characterized by
x ⊂ = arg min {Δ(y)/y ≡ Φ⊂ , ∗x, y∈ = n} and x ⊂ is the center of gravity of the cross
section {y ≡ Φ⊂ , ∗x, y∈ = n} of Φ⊂ :

⎞ ⎞
x⊂ = Γ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ ,
Φ⊂ Φ⊂
 
−x ⊂ , h = dh log ΔΦ (x)
⎞ ⎞
= − ∗Γ, h∈ e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ (7.16)
Φ⊂ Φ⊂

7.4.2 Koszul Entropy and Its Barycenter

From this last equation, we can deduce “Koszul Entropy” defined as Legendre
Transform of minus logarithm of Koszul–Vinberg characteristic function Ω(x):
 
Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x) with x ⊂ = Dx Ω and x = Dx⊂ Ω⊂ where Ω(x) = − log ΔΦ (x)
(7.17)

& ' $ %
Ω⊂ (x ⊂ ) = (Dx Ω)−1 (x ⊂ ), x ⊂ − Ω (Dx Ω)−1 (x ⊂ ) ∀x ⊂ ≡ {Dx Ω(x)/x ≡ Φ}
(7.18)

By (7.11), and using that − ∗Γ, x∈ = log e−∗Γ,x∈ we can write:


⎞ ⎞
 
− x⊂ , x = log e−∗Γ,x∈ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ (7.19)
Φ⊂ Φ⊂

and
  
Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈ − Ω(x) = − log e−∗Γ,x∈ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ + log e−∗Γ,x∈ dΓ
*+ , Φ ⊂ Φ ⊂
- Φ⊂
   
Ω⊂ (x ⊂ ) = e−∗Γ,x∈ dΓ · log e−∗Γ,x∈ dΓ − log e−∗Γ,x∈ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ
Φ⊂ Φ⊂ Φ⊂ Φ⊂
(7.20)
154 F. Barbaresco

We can then consider this Legendre transform as an Entropy, named Koszul


Entropy if we rewrite equation (7.20) in the new form

Ω⊂ = − px (Γ ) log px (Γ )dΓ (7.21)
Φ⊂

with
⎞ 
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
px (Γ ) = e −∗Γ,x∈
e−∗Γ,x∈
dΓ = e Φ⊂ = e−∗x,Γ ∈+Ω(x) and
Φ⊂

x⊂ = Γ · px (Γ )dΓ (7.22)
Φ⊂

−∗Γ,x∈
We will call px (Γ ) =  e
−∗Γ,x∈ dΓ the Koszul Density, with the property that:
Φ⊂ e


log px (Γ ) = − ∗x, Γ ∈ − log e−∗Γ,x∈ dΓ = − ∗x, Γ ∈ + Ω(x) (7.23)
Φ⊂

and ⎨ ⎩  
EΓ − log px (Γ ) = x, x ⊂ − Ω(x) (7.24)

We can observe that:


  ⊂  ⊂
Ω(x) = − log e−∗Γ,x∈ dΓ = − log e−[Ω (Γ )+Ω(x)] dΓ = Ω(x) − log e−Ω (Γ ) dΓ
 −Ω Φ⊂ Φ⊂ Φ⊂
⊂ (Γ )
⇒ e dΓ = 1
Φ⊂
(7.25)
To make appear x ⊂ in Ω⊂ (x ⊂ ), we have to write:

log px (Γ ) = log e−∗x,Γ ∈+Ω(x)
 = log e−Ω (Γ ) = −Ω ⊂
 ⊂(Γ )
⇒ Ω = − px (Γ ) log px (Γ )dΓ = Ω (Γ )px (Γ )dΓ = Ω⊂ (x ⊂ )

Φ⊂ Φ⊂
(7.26)
The last equality is true if and only if:
⎛ ⎠
⎞ ⎞ ⎞
⊂ ⊂⎝ ⎧ ⊂
Ω (Γ )px (Γ )dΓ = Ω Γ · px (Γ )dΓ as x = Γ · px (Γ )dΓ (7.27)
Φ⊂ Φ⊂ Φ⊂

This condition could be written:


7 Eidetic Reduction of Information Geometry 155
⎨ ⎩
E Ω⊂ (Γ ) = Ω⊂ (E [Γ ]) , Γ ≡ Φ⊂ (7.28)

The meaning of this relation is that “Barycenter of Koszul Entropy is Koszul Entropy
of Barycenter”.
This condition is achieved for x ⊂ = Dx Ω taking into account Legendre Transform
property:
⎨  ⎩
Legendre Transform: Ω* (x ⊂ ) = Sup x, x ⊂ − Ω(x)
x
 * ⊂ ⊂
 x ∈ − Ω(x)
Ω (x ) ≥ ∗x,
⇒ Ω* (x ⊂ ) ≥ Ω⊂ (Γ )px (Γ )dΓ
Φ⊂
 $ ⊂ %
Ω⊂ (x ⊂ ) ≥ E Ω (Γ )
⇒ (7.29)
equality for x ⊂ = dΩ
dx

7.4.3 Relation with Maximum Entropy Principle

Classically, the density given by Maximum Entropy Principle [188–190] is given by:
⎤ ⎭

⎞  px (Γ )dΓ = 1
Max ⎥− px (Γ ) log px (Γ )dΓ such Φ
⎬ ⊂
(7.30)
px (.)  Γ · px (Γ )dΓ = x ⊂
Φ⊂ Φ⊂

 
e−∗Γ,x∈ dΓ
If we take qx (Γ ) = e−∗Γ,x∈ / Φ⊂ e−∗Γ,x∈ dΓ = e−∗x,Γ ∈−log Φ⊂ such that:

⎭  

 qx (Γ ) · dΓ = e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ = 1
Φ⊂ Φ⊂ Φ⊂ 
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ  (7.31)

 = − ∗x, Γ ∈ − log e−∗x,Γ ∈ dΓ
log qx (Γ ) = log e Φ⊂
Φ⊂
 
Then by using the fact that log x ≥ 1 − x −1 with equality if and only if x = 1 , we
find the following:
⎞ ⎞ ⎪ 
px (Γ ) qx (Γ )
− px (Γ ) log dΓ ⊆ − px (Γ ) 1 − dΓ (7.32)
qx (Γ ) px (Γ )
Φ⊂ Φ⊂

We can then observe that:


⎞ ⎪  ⎞ ⎞
qx (Γ )
px (Γ ) 1 − dΓ = px (Γ )dΓ − qx (Γ )dΓ = 0 (7.33)
px (Γ )
Φ⊂ Φ⊂ Φ⊂
156 F. Barbaresco
 
because Φ⊂ px (Γ )dΓ = qx (Γ )dΓ = 1.
Φ⊂
We can then deduce that:
⎞ ⎞ ⎞
px (Γ )
− px (Γ ) log dΓ ⊆ 0 ⇒ − px (Γ ) log px (Γ )dΓ ⊆ − px (Γ ) log qx (Γ )dΓ
qx (Γ )
Φ⊂ Φ⊂ Φ⊂
(7.34)
If we develop the last inequality, using expression of qx (Γ ):
⎤ ⎫
⎞ ⎞ ⎞
− px (Γ ) log px (Γ )dΓ ⊆ − px (Γ ) ⎥− ∗x, Γ ∈ − log e−∗x,Γ ∈ dΓ ⎬dΓ (7.35)
Φ⊂ Φ⊂ Φ⊂

⎞ . ⎞ / ⎞
− px (Γ ) log px (Γ )dΓ ⊆ x, Γ · px (Γ )dΓ + log e−∗x,Γ ∈ dΓ (7.36)
Φ⊂ Φ⊂ Φ⊂
 
If we take x ⊂ = Γ · px (Γ )dΓ and Ω(x) = − log Φ⊂ e−∗x,Γ ∈ dΓ , then we deduce
Φ⊂  
−∗x,Γ ∈−log Φ⊂ e−∗Γ,x∈ dΓ is
that the Koszul density qx (Γ ) = e−∗Γ,x∈ / Φ⊂ e−∗Γ,x∈
 dΓ = e 
the Maximum Entropy solution constrained by Φ⊂ px (Γ )dΓ = 1 and Φ⊂ Γ · px (Γ )
dΓ = x ⊂ :

 
− px (Γ ) log px (Γ )dΓ ⊆ x, x ⊂ − Ω(x) (7.37)
Φ⊂

− px (Γ ) log px (Γ )dΓ ⊆ Ω⊂ (x ⊂ ) (7.38)
Φ⊂

We have then proved that Koszul Entropy provides density of Maximum Entropy:
 
e− Γ,Ψ (Γ̄ )
−1
dΩ(x)
pΓ̄ (Γ ) =  with x = Ψ−1 (Γ̄ ) and Γ̄ = Ψ(x) = (7.39)
e−∗Γ,Ψ (Γ̄ )∈ dΓ
−1
dx
Φ⊂

where ⎞ ⎞
Γ= Γ · pΓ̄ (Γ )dΓ and Ω(x) = − log e−∗x,Γ ∈ dΓ (7.40)
Φ⊂ Φ⊂

We can then deduce Maximum Entropy solution without solving Classical vari-
ational problem with Lagrangian hyperparameters, but only by inversing function
Ψ(x) .This remark was made by Jean-Marie
⎪  Souriau in the paper [97], if we take vec-
z
tor with tensor components Γ = , components of Γ̄ will provide moments of
z∧z
first and second order of the density of probability pΓ̄ (Γ ), that is defined by Gaussian
7 Eidetic Reduction of Information Geometry 157

law. In this particular case, we can write:

1
∗Γ, x∈ = aT z + zT Hz (7.41)
2

with a ≡ Rn and H ≡ Sym(n). By change of variable given by z≥ = H 1/2 z + H −1/2 a,


we can then compute the logarithm of the Koszul characteristic function:

1 $ T −1 $ % %
Ω(x) = − a H a + log det H −1 + n log(2π ) (7.42)
2

We can prove that first moment is equal to −H −1 a and that components of variance
tensor are equal to elements of matrix H −1 , that induces the second moment. The
Koszul Entropy, its Legendre transform, is then given by:

1$ $ % %
Ω⊂ (Γ̄ ) = log det H −1 + n log (2π · e) (7.43)
2

7.4.4 Crouzeix Relation and Its Consequences

The duality between dual potential functions is recovered by this relation:

  dΩ dΩ⊂
Ω⊂ (x ⊂ ) + Ω(x) = x, x ⊂ with x ⊂ = and x = where Ω(x) = − log ΔΦ (x)
dx dx ⊂

( dΩ 
⊂ d2Ω
= dx

$ 2 ⊂ %−1
dx ⊂ = x ⇒ dx 2

dx
⇒ d 2 Ω d 2 Ω⊂
· = 1 ⇒ d2Ω
= d Ω
d Ω
2
dx ⊂ = x

= dx 2 2 2 2
2 ⊂
dx dx ⊂ dx dx ⊂
dx ⊂ dx
$ 2 ⊂ %−1 $ 2 ⊂ %2 2 ⊂
ds2 = − ddxΩ2 dx 2 = − d Ω⊂2 · d Ω⊂2 · dx ⊂ = − d Ω⊂2
2
⇒ · dx ⊂2
dx dx dx
(7.44)
$ 2 ⊂ %−1
The relation ddxΩ2 = d Ω⊂2
2
has been established by J.P. Crouzeix in 1977 in a short
dx
communication [176] for convex smooth functions and their Legendre transforms.
This result has been extended for non-smooth function by Seeger [177] and Hiriart-
Urruty [178], using polarity relationship between the second-order subdifferentials.
This relation was mentioned in texts of variational calculus and theory of elastic
materials (with work potentials) [178].
This last relation has also be used in the framework of the Monge-Ampere measure
associated to a convex function, to prove equality with Lebesgue measure λ:

mΩ () = ϕ(x)dx = λ ({∀φ(x)/x ≡ }) (7.45)

$ %
∀ ≡ BΦ (Borel set in Φ) and φ(x) = det ∀ 2 Ω(x)
158 F. Barbaresco

⎨ ⎩−1
That is proved using Crouezix relation: ∀ 2 Ω(x) = ∀ 2 Ω (∀Ω⊂ (y)) = ∀ 2 Ω⊂ (y)
⎞ ⎞ $ %
mΩ () = ϕ(x)dx = det ∀ 2 Ω(x) · dx
 
⎞ $  % $ %
mΩ () = det ∀ 2 Ω ∀Ω⊂ (y) · det ∀ 2 Ω⊂ (y) dy
(∀Ω⊂ )−1 (A)

= 1 · dy = λ ({∀φ(x)/x ≡ }) (7.46)
∀Ω()

7.4.5 Koszul Metric and Fisher Metric

To make the link with Fisher metric given by Fisher Information matrix I(x) , we
can observe that the second derivative of log px (Γ ) is given by:
⎞ 
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
−∗Γ,x∈ −∗Γ,x∈
px (Γ ) = e / e dΓ = e Φ⊂

Φ⊂

⇒ log px (Γ ) = − ∗x, Γ ∈ − log e−∗Γ,x∈ dΓ (7.47)
Φ⊂

with Ω(x) = − log e−∗Γ,x∈ dΓ = − log Ω (x)
Φ⊂
* -
∂ 2 log px (Γ ) ∂ 2 Ω(x) ∂ 2 log px (Γ ) ∂ 2 Ω(x) ∂ 2 log ΔΦ (x)
= ⇒ I(x) = −EΓ =− =
∂x 2 ∂x 2 ∂x 2 ∂x 2 ∂x 2
(7.48)
⎜ 
∂ 2 log px (Γ ) ∂ 2 log ΔΦ (x)
I(x) = −EΓ = (7.49)
∂x 2 ∂x 2

We could then deduce the close interrelation between Fisher metric and hessian of
Koszul–Vinberg characteristic logarithm, that are totally equivalent.

7.4.6 Extended Results by Koszul, Vey and Sasaki

Koszul [1] and Vey [2, 160] have developed these results with the following theorem
for connected Hessian manifolds:
7 Eidetic Reduction of Information Geometry 159

Koszul–Vey Theorem: Let M be a connected Hessian manifold with Hessian metric g.


Suppose that admits a closed 1-form α such that Dα = g and there exists a group G
of affine automorphisms of M preserving α:
• If M/G is quasi-compact, then the universal covering manifold of M is affinely
isomorphic to a convex domain Φ real affine space not containing any full straight
line.
• If M/G is compact, then Φ is a sharp convex cone.
On this basis, Jean-Louis Koszul has given a Lie Group construction of homo-
geneous cone that has been developed and applied in Information Geometry by
Hirohiko Shima [163, 164] in the framework of hessian geometry.
Sasaki has developed the study of Hessian manifolds in Affine Geometry [49, 50].
He has denoted by Sc the level surface of ΔΦ : Sc = {ΔΦ (x) = c} which is a non-
compact submanifold in Φ, and by ωc the induced metric of d 2 log ΔΦ on Sc , then
assuming that the cone Φ is homogeneous under G(Φ), he proved that Sc is a homo-
geneous hyperbolic affine hypersphere and every such hyperspheres can be obtained
in this way .Sasaki also remarks that ωc is identified with the affine metric and Sc
is a global Riemannian symmetric space when Φ is a self-dual cone. He conclude
that, let Φ be a regular convex cone and let g = d 2 log ΔΦ be the canonical Hessian
metric, then each level surface of the characteristic function ΔΦ is a minimal surface
of the Riemannian manifold (Φ, g) .

7.4.7 Geodesics for Koszul Hessian Metric

Last contributor is Rothaus [147] that has studied the construction of geodesics for
this hessian metric geometry, using the following property:
⎪ 
1 il ∂gkl ∂gjl ∂gjk 1 il ∂ 3 log ΔΦ (x) ∂ 2 log ΔΦ (x)
jk
i
= g + − = g with gij =
2 ∂xj ∂xk ∂xl 2 ∂xj ∂xk ∂xl ∂xi ∂xj
(7.50)
or expressed also according the Christoffel symbol of the first kind:
⎪ 
⎨ ⎩ 1 ∂gkl ∂gjl ∂gjk 1 ∂ 3 log ΔΦ (x)
i, jk = + − = (7.51)
2 ∂xj ∂xk ∂xl 2 ∂xj ∂xk ∂xl

Then geodesic is given by:

d 2 xk dxi dxj d 2 xk ⎨ ⎩ dxi dxj


2
+ ijk = glk 2 + k, ij =0 (7.52)
ds ds ds ds ds ds
that could be developed with previous relation:
160 F. Barbaresco

d 2 xk ∂ 2 log ΔΦ 1 dxi dxj ∂ 3 log ΔΦ


+ =0 (7.53)
ds2 ∂xk ∂xl 2 ds ds ∂xl ∂xi ∂xj

We can then observe that:


⎜ 
d 2 ∂ log ΔΦ dxi dxj ∂ 3 log ΔΦ d 2 xk ∂ 2 log ΔΦ
= + (7.54)
ds 2 ∂xl ds ds ∂xi ∂xj ∂xl ds2 ∂xk ∂xl

The geodesic equation can then be rewritten:


⎜ 
d 2 xk ∂ 2 log ΔΦ d2 ∂ log ΔΦ
+ 2 =0 (7.55)
ds 2 ∂xk ∂xl ds ∂xl

That we can put in vector form using notations x ⊂ = −d log ΔΦ and Fisher matrix
I(x) = d 2 log ΔΦ :
d 2 x d 2 x⊂
I(x) 2 − =0 (7.56)
ds ds2

7.4.8 Koszul Forms and Metric for Symmetric Positive


Definite Matrices

Using Koszul results of Sect. 7.4.2, let v be the volume element of g. We define a
closed 1-form α and β a symmetric bilinear form by: DX v = α(X)v and β = Dα.
The forms α and β, called the first Koszul form and the second Koszul form for a
Hessian structure (D; g) respectively are given by:

 ⎨ ⎩1/2 1 ∂  ⎨ ⎩ 1
v = det gij dx √ · · · dx n ⇒ αi = i log det gij 2 v and
∂x ⎨ ⎩
∂αi 1 ∂ 2 log det gkl
βij = j =
∂x 2 ∂x i ∂x j
The pair (D; g) of flat connection D and Hessian metric g define the Hessian
structure. As seen previously, Koszul studied flat manifolds endowed with a closed
1-form α such that Dα is positive definite, whereupon Dα is a Hessian metric. The
Hessian structure (D; g) is said to be of Koszul type, if there exists a closed 1-form α
such that g = Dα. The second Koszul form β plays a role similar to the Ricci tensor
for Kählerian metric.
We can then apply this Koszul geometry framework for cones of Symmetric
Positive Definite Matrices.
Let the inner product ∗x, y∈ = Tr (xy) , ∀x, y ≡ Symn (R), Φ be the set of symmetric
positive definite matrices is an open convex cone and is self-dual Φ⊂ = Φ.
7 Eidetic Reduction of Information Geometry 161

n+1
ΔΦ (x) = e−∗Γ,x∈ dΓ = det x − 2 Δ(In ) (7.57)
∗x, y∈ = Tr(xy)
Φ⊂
Φ⊂ = Φ self-dual

n+1 2 n+1
g = d 2 log ΔΦ = − d log det x and x ⊂ = −d log ΔΦ = d log det x
2 2
n + 1 −1
= x (7.58)
2
Sasaki approach could also be used for the regular convex cone consisting of
all positive definite symmetric matrices of degree n. (D, Dd log det x) is a Hessian
structure on Φ, and each level surface of det x is a minimal surface of the Riemannian
manifold (Φ, g = −Dd log det x) .
Koszul [27] has introduced another 1-form definition for homogeneous bounded
domain given by:

1
α = − d (X) with  (X) = Trg/b [ad (JX) − Jad(X)] ∀X ≡ g (7.59)
4
We can illustrate this new Koszul expression for Poincaré’s Upper Half Plane V =
{z = x + iy/y > 0} (most simple symmetric homogeneous bounded domain). Let
vector fields X = y dx
d
and Y = y dy
d
, and J tensor of complex structure V defined by
(
Tr [ad (JX) − Jad (X)] = 2
JX = Y . As [X, Y ] = −Y and ad (Y ) . Z = [Y , Z] then
Tr [ad (JY ) − Jad (Y )] = 0
(7.60)
The Koszul 1-form and then the Koszul/Poincaré metric is given by:

dx 1 1 dx √ dy dx 2 + dy2
 (X) = 2 ⇒ α = − d = − 2
⇒ ds2 = (7.61)
y 4 2 y 2y2

This could be also applied for Siegel’s Upper Half Plane V = {Z = X + iY / Y


> 0} (more natural extension of Poincaré Upper-half plane, and general notion of
symmetric bounded homogeneous domains studied by Elie Cartan and Carl-Luwid
siegel):
( ⎪  ⎪ 
SZ = (AZ + B)D−1 AB 0 I
with S = and J = (7.62)
AT D = I, BT D = DT B 0 D −I 0
  
3p + 1  −1  α = − 41 d = 3p+1 Tr Y −1 dZ √ Y −1 d Z̄
(dX + idY ) = Tr Y dX ⇒  8 
2 ds2 = (3p+1) −1 −1
8 Tr Y dZY d Z̄
(7.63)
162 F. Barbaresco

To recover the metric of Symmetric Positive


$ Definite (HPD) matrices, we take
−1
2 %
Z = iR (with X = 0), the metric ds = Tr R dR . In the context of Information
2

Geometry, this metric is metric for multivariate Gaussian law of covariance matrix
R and zero mean. This metric will be studied in Sect. 7.7

7.4.9 Koszul–Vinberg Characteristic Function as Universal Barrier


in Convex Programming

Convex Optimization theory has developed the notion of Universal barrier that could
be also interpreted as Koszul–Vinberg Characteristic function. Homogeneous cones
that have been studied by Elie Cartan, Carl Ludwig Siegel, Ernest Vinberg and Jean-
Louis Koszul are very specific cones because of their invariance properties under
linear transformations. They were classified and algebraic constructed by Vinberg
using Siegel domains. Homogeneous cones [41, 42, 53–55] have also been studied
for developing interior-point algorithms in optimization, through the notion of self-
concordant barriers. Nesterov and Nemirovskii [145] have shown that any cone in
Rn admits a logarithmically homogeneous universal barrier function, defined as a
volume integral. Güler and Tunçel [37, 38, 143, 148] have used a recursive scheme
to construct a barrier through Siegel domains referring to a more general construction
of Nesterov and Nemirovskii. More recently, Olena Shevchenko [142] has defined a
recursive formula for optimal dual barrier functions on homogeneous cones, based
on the primal construction of Güler and Tunçel by means of the dual Siegel cone
construction of Rothaus [144, 147].
Vinberg and Gindikin have proved that every homogeneous cones can be obtained
by a Siegel construction. To illustrate Siegel construction, we can consider K an
homogeneous cone in Rk and B(u, u) ≡ K a K-bilinear symmetric form, Siegel Cone
is then defined by:
0 1
CSiegel (K, B) = (x, u, t) ≡ Rk xRp xR/t > 0, tx − B(u, u) ≡ K (7.64)

and is homogeneous.
Rothaus has considered dual Siegel domain construction by means of a symmetric
linear mapping U(y) (positive definite for y ≡ int K ⊂ ) defined by ∗U(y)u, v∈ =
∗B(u, v), y∈, from which we can define dual Siegel cone:
0 & '1

CSiegel (K, B) = (y, v, s) ≡ K ⊂ xRp xR/t > 0, s > U(y)−1 v, v (7.65)

 product ∗(x,
With respect to the inner  u, t), (y, v, s)∈ = ∗x, y∈ + 2 ∗u, v∈ + st.
With the condition s > U(y)−1 v, v that is equivalent by means of Schur complement
technique to:
7 Eidetic Reduction of Information Geometry 163
⎜ 
s vT
>0 (7.66)
v U(y)

7.4.10 Characteristic Function and Laplace Principle


of Large Deviations

In probability, the “characteristic function” has been introduced by Poincaré by anal-


ogy with Massieu work in thermodynamic. For the Koszul–Vinberg characteristic
function, if we replace the Lebesgue measure by Borel measure, then we recover the
classical definition of characteristic function in Probability, and the previous KVCF
could be compared by analogy with:

⎞+→ ⎞+→ $ %
ΔX (z) = e dF(x) =
izx
eizx p(x) · dx = E eizx (7.67)
−→ −→

Let μ be a positive Borel Measure on euclidean space V. Assume that the following
integral is finite for all x in an open set

Φ ⊗ V : Δx (y) = e−∗y,x∈ dμ(x) (7.68)

For x ≡ Φ, consider the probability measure:

1 −∗y,x∈
p (y, dx) = e dμ(x) (7.69)
Δx (y)

then mean is given by:



m(y) = xp(y, dx) = −∀ log Δx (y) (7.70)

and covariance

∗V (y)u, v∈ = ∗x − m(y), u∈ ∗x − m(y), v∈ p(y, dx) = Du Dv log Δx (y) (7.71)

In large deviation theory, the objective is to study the asymptotic behavior of


remote tails of sequences of probability distributions and was initialized by Laplace
[10] and unified formally by Sanov [44] and Varadhan [43]. Large deviations theory
concerns itself with the exponential decline of the probability measures of extreme,
tail or rare events, as the number of observations grows arbitrarily large [45, 46].
164 F. Barbaresco

Large deviation principle: Let { n } be a sequence of random variables indexed by the


positive integer n, and let P ( n ≡ dς ) = P ( n ≡ [ς, ς + dς ]) denote the probabil-
ity measure associated with these random variables. We say that n or P ( n ≡ dς )
satisfies a large deviation principle if the limit I(ς ) = Limn◦→ − n1 log P ( n ≡ dς )
exists The function I (ς ) defined by this limit is called the rate function; the para-
meter n of decay is called in large deviation theory the speed. The existence of
a large deviation principle for n means concretely that the dominant behavior
of P ( n ≡ dς ) is a decaying exponential with n, with rate I (ς ) : P ( n ≡ dς ) ↔
e−n·I(ς ) dς or p ( n = ς) ↔ e−n·I(ς ) .
If we write the Legendre-Fenchel transform of a function h(x) defined by g(k) =
Sup {k · x − h(x)}, we can express the following principle and Laplace’s method:
x
Laplace Principle [10]: Let {(Φn , Fn , Pn ) , n ≡ N} be a sequence of probability
spaces, Ψ a complete separable metric space, {Yn , n ≡ N} a sequence of random
variables such that Yn maps Φn into Ψ, and I a rate function on Ψ. Then, Yn satis-
fies the Laplace principle on Ψ with rate function I if for all bounded, continuous
functions f mapping Ψ into R:

1 $ %
Lim log EPn en·f (Yn ) = Sup {f (x) − I(x)} (7.72)
n◦→ n x≡Ψ

If Yn satisfies the large deviation principle on Ψ with rate function I, then Pn


(Yn ≡ dx) ↔ e−n·I(x) dx:
$ % ⎞
1 n·f (Yn ) 1
Lim log EPn e = Lim log enf (Yn ) · dPn
n◦→ n n◦→ n
Φn

1
= Lim log en·f (x) Pn (Yn ≡ dx)
n◦→ n
Ψ

1
↔ Lim log en·f (x) e−n·I(x) · dx
n◦→ n
Ψ

1
= Lim log en·[f (x)−I(x)] · dx (7.73)
n◦→ n
Ψ

The asymptotic behavior of the last integral is determined by the largest value of
the integrand [10]:
$ % 1 ⎜ 0 1
1 n·f (Yn ) n·[f (x)−I(x)]
Lim log EPn e = log Sup e = Sup {f (x) − I(x)}
n◦→ n n x≡Ψ x≡Ψ
(7.74)
The generating function of n is defined as:
7 Eidetic Reduction of Information Geometry 165

⎨ ⎩
Δn (x) = E enx n = enxς P ( n ≡ dς ) (7.75)

In terms of the density p ( n ), we have instead



Δn (x) = enxς p ( n = ς) dς (7.76)

In both expressions, the integral is over the domain of n . The function φ (x) defined
by the limit: φ (x) = Lim n1 log Δn (x) is called the scaled cumulant generating
n◦→
function of n . It is also called the log-generating function or free energy function
of n . In the following in thermodynamic, it will be called Massieu Potential. The
existence of this limit is equivalent to writing Δn (x) ↔ en·φ(x) .
Gärtner-Ellis Theorem: If φ(x) is differentiable, then n satisfies a large deviation
principle with rate function I (ς ) by the Legendre–Moreau [158] transform of φ(x):

I(ς ) = Sup {x · ς − ϕ(x)} (7.77)


x

Varadhan’s Theorem: If n satisfies a large deviation principle with rate function


I(ς ), then its scaled cumulant generating function φ(x) is the Legendre–Fenchel
transform of I(ς ):
ϕ(x) = Sup {x · ς − I(ς )} (7.78)
ς

Properties of ϕ(x): statistical moments of n are given by derivatives of φ(x):


) )
dϕ(x) )) d 2 ϕ(x) ))
= Lim E [ n ] and = Lim Var [ n] (7.79)
dx )x=0 n◦→ dx 2 )x=0 n◦→

7.4.11 Legendre Mapping by Contact Geometry in Mechanics

Legendre Transform is also classically used to derive the Hamiltonian formalism of


mechanics from the Lagrangian formulation. Lagrangian is a convex function of the
tangent space. The Legendre transform gives the Hamiltonian H(p,q) as a function of
the coordinates (p,q) of the cotangent bundle, where the inner product used to define
the Legendre transform is inherited from the pertinent canonical symplectic structure.
More recently, Vladimir Arnold has demonstrated that this Legendre duality could be
interpreted in the framework of Contact Geometry with notion of Legendre mapping
[96] (Fig. 7.5).
As described by Vladimir Arnold, in the general case, we can define the Hamil-
tonian H as the fiberwise Legendre transformation of the Lagrangian L:
166 F. Barbaresco

Fig. 7.5 Legendre transform


between Hamiltonian and
Lagrangian

Fig. 7.6 Poincaré–Cartan


invariant

H(p, q, t) = Sup (p · q̇ − L(q, q̇, t)) (7.80)


Due to strict convexity, H(p, q, t) = p · q̇ − L(q, q̇, t) supremum is reached in a


unique point q̇ such that p = ∂q̇ L(q, q̇, t), and q̇ = ∂p H(p, q, t).
If we consider total differential of Hamiltonian:
2
dH = q̇dp + pd q̇ − ∂q Ldq − ∂q̇ Ld q̇ − ∂t Ldt = q̇dp − ∂q Ldq − ∂t Ldt
(7.81)
= ∂p Hdp + ∂q Hdq + ∂t Hdt
(
q̇ = ∂p H

−∂q L = ∂q H

Euler–Lagrange equation ∂t ∂q̇ L − ∂q L = 0 with p = ∂q̇ L and −∂q L = ∂q H


provides the second Hamilton equation ṗ = −∂q H with q̇ = ∂p H in Darboux
coordinates.
Considering Pfaffian form ω = p · dq −H · dt related to Poincaré–Cartan integral
invariant [26], based on ω = ∂q̇ L · dq − ∂q̇ L · q̇ − L · dt = L · dt + ∂q̇ L with
7 Eidetic Reduction of Information Geometry 167

 = dq − q̇ · dt, Dedecker [15] has observed, that the property that among all forms
θ ∓ L·dt mod the form ω = p·dq−H ·dt is the only one satisfying dθ ∓ 0 mod ,
is a particular case of more general Lepage congruence [16] related to transversally
condition (Fig. 7.6).
Contact geometry was used in Mechanic [11, 12] and in Thermodynamic [13,
14, 17, 18, 24, 25], where integral submanifolds of dimension n in 2n + 1 dimen-
sional contact manifold are called Legendre submanifolds. A smooth fibration of a
contact manifold, all of whose are Legendre, is called a Legendre Fibration. In the
neighbourhood of each point of the total space of a Legendre Fibration there exist
contact Darboux coordinates (z, q, p) in which the fibration is given by the projection
(z, q, p) => (z, q). Indeed, the fibres (z, q) = cst are Legendre subspaces of the stan-
dard contact space. A Legendre mapping is a diagram consisting of an embedding
of a smooth manifold as a Legendre submanifold in the total space of a Legendre
fibration, and the projection of the total space of the Legendre fibration onto the base.
Let us consider the two Legendre fibrations of the standard contact space R2n+1 of
1—jets of functions on Rn : (u, p, q) ∨◦ (u, q) and (u, p, q) ∨◦ (p · q − u, p), the
projection of the l-graph of a function u = S(q) onto the base of the second fibration
gives a Legendre mapping:
⎪ 
∂S ∂S
q ∨◦ q − S(q), (7.82)
∂q ∂q

If S is convex, the front of this mapping is the graph of a convex function, the
Legendre transform of the function S:
 ⊂ 
S (p), p (7.83)

with S ⊂ (p) = q · p − S(q) where p = ∂S∂q .


More generally as considered by Vladimir Arnold, all Symplectic geometry of
even-dimensional phase spaces has an odd-dimensional twin contact geometry. The
relation between contact geometry and symplectic geometry is similar to the relation
between linear algebra and projective geometry. Any fact in symplectic geometry
can be formulated as a contact geometry fact and vice versa. The calculations are
simpler in the symplectic setting, but their geometric content is better seen in the
contact version. The functions and vector fields of symplectic geometry are replaced
by hypersurfaces and line fields in contact geometry.
Each contact manifold has a symplectization, which is a symplectic manifold
whose dimension exceeds that of the contact manifold by one. Inversely, Symplectic
manifolds have contactizations whose dimensions exceed their own dimensions by
one. If a manifold has a serious reason to be odd dimensional it usually carries a
natural contact structure. V. Arnold said that “symplectic geometry is all geometry,”
but that he prefer to formulate it in a more geometrical form: “contact geometry is
all geometry”.
Relation between contact structures and Legendre submanifolds are defined by:
168 F. Barbaresco

Fig. 7.7 Contact geometry by


V. Arnold

• A contact structure on an odd-dimensional manifold M 2n+1 is a field of hyperplanes


(of linear subspaces of codimension 1) in the tangent spaces to M at all its points.
• All the generic fields of hyperplanes of a manifold of a fixed dimension are locally
equivalent. They define the (local) contact structures.
As example, a 1-jet of a function:

y = f (x1 , . . . , xn ) (7.84)

at point x of manifold V n is defined by the point

(x, y, p) ≡ R2n+1 where pi = ∂f /∂xi (7.85)

The natural contact structure of this space is defined by the following condition:
the 1-graphs {x, y = f (x), p = ∂f /∂x} ⊗ J 1 (V n , R) of all the functions on V should
be the tangent structure hyper-plane at every point. In coordinates, this conditions
means that the 1-form dy − p.dx should vanish on the hyper-planes of the contact
field (Fig. 7.7).
In thermodynamics, the Gibbs 1-form dy − p · dx will be given by:

dE = T · dS − p · dV (7.86)

A contact structure on a manifold is a non-degenerate field of tangent hyper-planes:


• The manifold of contact elements in projective space coincides with the manifold
of contact elements of the dual projective space
• A contact element in projective space is a pair, consisting of a point of the space
and of a hyper-plane containing this point. The hyper-plane is a point of the dual
projective space and the point of the original space defines a contact element of
the dual space.
7 Eidetic Reduction of Information Geometry 169

The manifold of contact elements of the projective space has two natural contact
structures:
• The first is the natural contact structure of the manifold of contact elements of the
original projective space.
• The second is the natural contact structure of the manifold of contact elements of
the dual projective space.
The dual of the dual hyper-surface is the initial hyper-surface (at least if both are
smooth for instance for the boundaries of convex bodies). The affine or coordinate
version of the projective duality is called the Legendre transformation. Thus contact
geometry is the geometrical base of the theory of Legendre transformation.

7.4.12 Massieu Characteristic Function and Duhem Potentials


in Thermodynamic

In 1869, François Massieu, French Engineer from Corps des Mines, has presented
two papers to French Science Academy on “characteristic function” in Thermody-
namic [3–5]. Massieu demonstrated that some mechanical and thermal properties of
physical and chemical systems could be derived from two potentials called “charac-
teristic functions”. The infinitesimal amount of heat dQ received by a body produces
external work of dilatation, internal work, and an increase of body sensible heat.
The last two effects could not be identified separately and are noted dE (function
E accounted for the sum of mechanical and thermal effects by equivalence between
heat and work). The external work P · dV is thermally equivalent to A · P · dV (with A
the conversion factor between mechanical and thermal measures). The first principle
provides
 dQ dQ = dE +A·P ·dV . For a closed reversible cycle (Joule/Carnot principles)
T = 0 that is the complete differential dS of a function S of dS = dQ
T .
If we select volume V and temperature T as independent variables:

T · dS = dQ ⇒ T · dS − dE = A · P · dV ⇒ d (TS) − dE = S · dT + A · P · dV (7.87)

∂H ∂H
If we set H = TS − E, then we have dH = S · dT + A · P · dV = · dT + · dV
∂T ∂V
(7.88)
Massieu has called H the “characteristic function” because all body characteristics
could be deduced of this function: S = ∂H 1 ∂H ∂H
∂T , P = A ∂V and E = TS − H = T ∂T − H

If we select pressure P and temperature T as independent variables: Massieu char-


acteristic function is then given by

H ≥ = H − AP · V (7.89)
170 F. Barbaresco

We have

∂H ≥ ∂H ≥
dH ≥ = dH − AP · dV − AV · dP = S · dT − AV · dP = · dT + · dP (7.90)
∂T ∂P
And we can deduce:
∂H ≥ 1 ∂H ≥
S= and V = − (7.91)
∂T A ∂P
And inner energy:

∂H ≥ ∂H ≥
E = TS − H = TS − H ≥ − AP · V ⇒ E = T − H≥ + P · (7.92)
∂T ∂P
The most important result of Massieu consists in deriving all body proper-
ties dealing with thermodynamics from characteristic function and its derivatives:
“je montre, dans ce mémoire, que toutes les propriétés d’un corps peuvent se
déduire d’une fonction unique, que j’appelle la fonction caractéristique de ce
corps” [5].
Massieu results were extended by Gibbs that shown that Massieu functions play
the role of potentials in the determination of the states of equilibrium in a given
system, and Duhem [6–9] in a more general presentation had introduced Thermo-
dynamic potentials in putting forward the analytic development of the mechanical
Theory of heat.
In thermodynamics, the Massieu potential is the Legendre transform of the
Entropy, and depends on the inverse temperature β = 1/kT :

φ(β) = −kβ · F = S − E/T where F is the Free Energy F = E − TS (7.93)


⎪  ⎪ 
∂φ (β) ∂ (kβF) ∂F ∂F ∂T
− = = F + kβ = F + kβ (7.94)
∂ (kβ) ∂ (kβ) ∂ (kβ) ∂T V ∂ (kβ)
⎪ 
∂φ (β) 1 S
− = F + kβ(−S) − =F+ = (E − TS) + TS = E (7.95)
∂ (kβ) (kβ) 2 kβ

with E the inner energy. The Legendre transform of the Massieu potential provides
the Entropy S:

∂φ(β)
L (φ) = kβ · − φ(β) = kβ · (−E) − kβ · F = kβ (F − E) = −S (7.96)
∂ (kβ)

On these bases, Duhem has founded Mechanics on the principles of Thermody-


namics. Duhem has defined a more general potential function Φ = G(E − TS) + W
with G mechanic equivalent of heat. In case of constant volume, W = 0, the potential
Φ becomes Helmoltz Free Energy and in the case of constant pressure, W = P·V , the
potential Φ becomes Gibbs–Duhem Free Enthalpy. Duhem has written: “Nous avons
7 Eidetic Reduction of Information Geometry 171

Fig. 7.8 Thermodynamic


square of Max Born (V vol-
ume, P pressure, T temper-
ature, S entropy, U energy,
F free “Helmoltz” energy,
H enthalpy, G free “Gibbs–
Duhem” enthalpy)

fait de la Dynamique un cas particulier de la Thermodynamique, une Science qui


embrasse dans des principes communs tous les changements d’état des corps, aussi
bien les changements de lieu que les changements de qualités physiques” [6].
Classically, thermodynamic potentials and relations are synthetized in the thermo-
dynamic square introduced by Max Born The corners represent common conjugate
variables while the sides represent thermodynamic potentials. The placement and
relation among the variables serves as a key to recall the relations they constitute
(Fig. 7.8).
In his “Thermodynamique générale”, four scientists were credited by Duhem with
having carried out “the most important researches on that subject”:
• F. Massieu that had derived Thermodynamics from a “characteristic function and
its partial derivatives”
• J. W. Gibbs that shown that Massieu’s functions “could play the role of potentials
in the determination of the states of equilibrium” in a given system.
• H. von Helmholtz that had put forward “similar ideas”
• A. von Oettingen that had given “an exposition of Thermodynamics of remark-
able generality” based on general duality concept in “Die thermodynamischen
Beziehungen antithetisch entwickelt”, St. Petersburg 1885.
Arthur Joachim Von Oettingen has developed duality in parallel in Thermody-
namic and in Music [66–76]. In Music, he has introduced Harmonic Dualism: Cate-
gories of music-theoretical work that accept the absolute structural equality of major
and minor triads as objects derived from a single, unitary process that structurally
contains the potential for twofold, or binary, articulation. The tonic fundamental of
the major triad is the structural parallel of the phonic overtone of the minor triad: in
each case these tones are consonant with their respective chord. The Phonic overtone
of the major triad and the tonic fundamental of the minor triad are dissonant with
their respective chord. Oettingen introduced Topographic version of major–minor
opposition: “All pure consonant triads stand in the form of right triangles, whose
hypotenuses all form a diagonal minor third. In the major klang, the right angle is
172 F. Barbaresco

oriented to the top (of the diagram); in the minor klang, the right angle is oriented to
the bottom”.
To conclude this chapter on “characteristic function” in thermodynamics, we
introduce recent works of Pavlov and Sergeev [111] from Steklov Mathematical
Institute, that have used ideas of Koszol [112], Berezin [113], Poincaré [114] and
Caratheodory [115] to analyze the geometry of the space of thermodynamical states
and demonstrate that the differential-geometric structure of this space selects entropy
as the Lagrange function in mechanics in the sense that determining the entropy
completely characterizes a thermodynamic system (this development will make one
link with our Sect. 7.4.11 related to Legendre transform in Mechanics). They have
demonstrated that thermodynamics is not separated from other branches of theoretical
physics on the level of the mathematical apparatus.
Technically, both analytic mechanics and thermodynamics are based on the same
dynamical principle, which we can use to obtain the thermodynamic equations as
equations of motion of a system with non-holonomic constraints. The special feature
of thermodynamics is that the number of these constraints coincides with the number
of degrees of freedom.
Using notation, β = 1/k · T , the laws of equilibrium thermodynamics is based
on the Gibbs distribution:
1
pGibbs (H) = e−βH (7.97)
Z
where H(p, q, a) the Hamilton function depending on the external parameters
a = (a1 , . . . , ak ) and the normalizing factor:
⎞
Z(β, a) = e−βH dωN (7.98)
N

with the internal Energy:

 ⎞  ⎞
∂ log Z e−βH
E = Ĥ = − = H· dωN = H · pGibbs (H) · dωN (7.99)
∂β Z
N N

Gibbs distribution to derive thermodynamic states is based on the theorem for-


mulated by Berezin [113] (entropy functional subject to the internal energy and
normalization has a unique maximum at the “point” for the Gibbs distribution).
P.V. Pavlov and V.M. Sergeev define the generalized forces by the formula:
 ⎞ ∂H 1 ∂ log Z
A=− · pGibbs (H) · dωN = (7.100)
∂a β ∂a
N

The 1-form of the heat source is given by:


⎪ 
∂ log Z 1 ∂ log Z
ωQ = dE + A · da = −d + · da (7.101)
∂β β ∂a
7 Eidetic Reduction of Information Geometry 173

Using the relation:


⎪  ⎪ 
−1 ∂ log Z −1 ∂ log Z ∂ log Z
β d β =β dβ + d (7.102)
∂β ∂β ∂β

They obtain correspondence with the Shannon definition of the entropy


⎪  + ,
∂ log Z ⎞
βωQ = d log Z − β =d − pGibbs (H) · log pGibbs (H) · dμN
∂β
N
= dS (7.103)

providing:
• the second law of thermodynamics:

ωQ = T · dS (7.104)

• the Gibbs equation:

dE = T · dS − A · da or dF = −S · dT − A · da with F = E − TS (7.105)

From the last Gibbs equation, the coefficients of the 1-forms are gradients

∂F ∂F
S=− and A = − (7.106)
∂T ∂a
and the free energy is the first integral of system of differential equation.
Pavlov and Seeger note that the matrix of 2-form ΦQ = dωQ is degenerate and
has the rank 2 (nonzero elements of the 2-form matrix ΦQ are concentrated in the
first row and first column):

∂A ∂ 2F
ΦQ = dT √ da = − dT √ da (7.107)
∂T ∂a∂T
They observe that from the standpoint of the analogy with mechanical dynamical
systems, the state space of a thermodynamic system (with local coordinates) is a
configuration space. By physical considerations (temperature, volume, mean particle
numbers, and other characteristics are bounded below by zero), it is a manifold with
a boundary. They obtain the requirement natural for foliation theory that the fibers
be transverse to the two-dimensional underlying manifold.
In Pavlov and Seeger formulation, the thermodynamic analysis of a dynamical
system begins with a heuristic definition of the system entropy as a function of
dynamical variables constructed based on general symmetry properties and the geo-
metric structure of the space of thermodynamic states is fixed to be a foliation of
codimension two.
174 F. Barbaresco

7.4.13 Souriau’s Geometric Temperature and Covariant Definition


of Thermodynamic Equilibriums

Jean-Marie Souriau in [97–106], student of Elie Cartan at ENS Ulm, has given a
covariant definition of thermodynamic equilibriums and has formulated statistical
mechanics [161, 162] and thermodynamics in the framework of Symplectic Geom-
etry by use of symplectic moments and distribution-tensor concepts, giving a geo-
metric status for temperature and entropy. This work has been extended by Vallée
and Saxcé [107, 110, 156], Iglésias [108, 109] and Dubois [157].
The first general definition of the “moment map” (constant of the motion for
dynamical systems) was introduced by J. M. Souriau during 1970s, with geometric
generalizion such earlier notions as the Hamiltonian and the invariant theorem of
Emmy Noether describing the connection between symmetries and invariants (it is
the moment map for a one-dimensional Lie group of symmetries). In symplectic
geometry the analog of Noether’s theorem is the statement that the moment map of a
Hamiltonian action which preserves a given time evolution is itself conserved by this
time evolution. The conservation of the moment of a Hamilotnian action was called
by Souriau the “Symplectic or Geometric Noether theorem” (considering phases
space as symplectic manifold, cotangent fiber of configuration space with canonical
symplectic form, if Hamiltonian has Lie algebra, moment map is constant along
system integral curves. Noether theorem is obtained by considering independently
each component of momemt map).
In previous approach based on Koszul work, we have defined two convex functions
Ω(x) and Ω⊂ (x ⊂ ) with dual system of coordinates x and x ⊂ on dual cones Φ and Φ⊂ :

 
Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ and Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x)
Φ⊂

=− px (Γ ) log px (Γ )dΓ (7.108)
Φ⊂

where
⎞ ⎞
⊂ −∗Γ,x∈
x = Γ · px (Γ )dΓ and px (Γ ) = e / e−∗Γ,x∈ dΓ
Φ⊂ Φ⊂

−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
=e Φ⊂ = e−∗x,Γ ∈+Ω(x) (7.109)

with
∂Ω(x) ∂Ω⊂ (x ⊂ )
x⊂ = and x = (7.110)
∂x ∂x ⊂
Souriau introduced these relations in the framework of variational problems
to extend them with a covariant definition. Let M a differentiable manifold with
7 Eidetic Reduction of Information Geometry 175

a continuous positive density dω and let E a finite vector space and U(Γ ) a con-
tinuous function defined on M with values in E, continuous positive function p(Γ )
solution of this variational problem:
⎤ ⎫ ⎭
⎞  p(Γ )dω = 1
ArgMin ⎥s = − 
p(Γ ) log p(Γ )dω⎬ such that M (7.111)
p(Γ )  U(Γ )p(Γ )dω = Q
M M

is given by:

Ω(β)−β·U(Γ )
p(Γ ) = e with Ω(β) = − log e−β·U(Γ ) dω and Q
M

U(Γ )e−β·U(Γ ) dω
M
=  (7.112)
e−β·U(Γ ) dω
M

Entropy s = − M p(Γ ) log p(Γ )dω can be stationary only if there exist a scalar Ω
and an element β belonging to the dual of E, where Ω and β are Lagrange parame-
ters associated to the previous constraints. Entropy appears naturally as Legendre
transform of Ω:
s(Q) = β · Q − Ω(β) (7.113)
 −β·U(Γ ) dω
MU(Γ )e
This value is a strict minimum of s, and the equation Q = −β·U(Γ ) dω has a
Me
maximum of one solution for each value of Q. The function Ω(β) is differentiable
and we can write dΩ = dβ · Q and identifying E with its bidual:

∂Ω
Q= (7.114)
∂β

Uniform convergence of M U(Γ ) ∧ U(Γ )e−β·U(Γ ) dω proves that − ∂∂βΩ2 > 0 and
2

that −Ω(β) is convex. Then, Q(β) and β(Q) are mutually inverse and differentiable,
and ds = β · dQ. Identifying E with its bidual:

∂s
β= (7.115)
∂Q
⎪ 
Γ
Classically, if we take U(Γ ) = , components of Q will provide moments of
Γ ∧Γ
first and second order of the density of probability p(Γ ), that is defined by Gaussian
law.
Souriau has applied this approach for classical statistical mechanic system.
Considering a mechanical system with n parameters q1 , . . . , qn , its movement
176 F. Barbaresco

could be defined by its phase at arbitrary time t on a manifold of dimension


2n: q1 , . . . , qn , p1 , . . . , pn .
Liouville theorem shows that coordinate changes have a Jacobian equal to unity,
and a Liouville density could be defined on manifold M: dω = dq1 . . . dqn dp1 . . . dpn
that will not depend of choice for t.
A system state is one point on 2n-Manifold
 M and a statistical state is a law of
probability defined on M such that M p(Γ )dω = 1, and its time evolution is driven
3 ∂p ∂H ∂p ∂H
by: ∂p
∂t = ∂pj ∂qj − ∂qj ∂pj where H is the Hamiltonian.
A thermodynamic equilibrium is a statistical state that maximizes the entropy:

s=− p(Γ ) log p(Γ )dω (7.116)
M

among all states giving the mean value of energy Q:



H(Γ ) · p(Γ )dω = Q (7.117)
M

Apply for free particles, for ideal gas, equilibrium is given for β = kT 1
(with k
Boltzmann constant) and
 dQ if we set S = k · s, previous relation ds = β · dQ provides:
dS = dQ
T and S = T and Ω(β) is identified with Massieu–Duhem Potential.
We recover also the Maxwell Speed law:
H
p(Γ ) = cste · e− kT (7.118)

Main discovery of Jean-Marie Souriau is that previous thermodynamic equilib-


rium is not covariant on a relativity point of view. Then, he has proposed a covariant
definition of thermodynamic equilibrium where the previous definition is a partic-
ular case. In previous formalization, Manifold M was solution of the variationnal
problem:
⎞t1 ⎪ 
dqj ∂l
d l t, qj , dt = 0 with pj = (7.119)
dt ∂qj
t0

We can then consider the time variable t as other variables qj through an arbitrary
parameter τ , and defined the new variationnal problem by:

⎞t1
dqJ
d L (qJ , q̇J )dτ = 0 with t = qn+1 , q̇J = and J = 1, 2, . . . , n + 1 (7.120)

t0
7 Eidetic Reduction of Information Geometry 177

where ⎪ 
q̇j
L (qJ , q̇J ) = l t, qj , ṫ (7.121)

Variables pj are not changed and we have the relation:

 dqj
pn+1 = l − pj · (7.122)
dt
j

If we compare with Sect. 7.4.11 on classical mechanic, we have:


 dqj
pn+1 = −H with H = pj · − l (H is Legendre transform of l ) (7.123)
dt
j

H is energy of system that is conservative if the Lagragian doesn’t depend explicitly


of time t. It is a particular case of Noether Theorem:
3 If Lagrangian L is invariant by an
infinitesimal transform dQJ = FJ (QK ), u = J pJ dQJ is first integral of variations
equations. As Energy is not the conjugate variable of time t, or the value provided
by Noether theorem by system invariance to time translation, the thermodynamical
equilibrium is not covariant. Then, J. M. Souriau proposes a new covariant definition
of thermodynamic equilibrium:
Let a mechanical system with a Lagragian invariant by a Lie Group G. Equilibrium
states by Group G are statistical states that maximizes the Entropy, while providing
given mean values to all variables associated by Noether theorem to infinitesimal
transforms of group G.
Noether theorem allows associating to all system movement Γ , a value U(Γ )
belonging to the vector dual A of Lie Algebra of group G. U(Γ ) is called the moment
of the group.
For each derivation δ of this Lie algebra, we take:

U(Γ )(δ) = pJ · δQJ (7.124)
J

With previous development, as E is dual of A, value β belongs to this Lie algebra A,


geometric generalization of thermodynamic temperature. Value Q is a generalization
of heat and belongs to the dual of A.
An Equilibrium state exists having the largest entropy, with a distribution function
p(Γ ) that is the exponential of an affine function of U:

p(Γ ) = eΩ(β)−β·U(Γ ) with Ω(β) = − log e−β·U(Γ ) dω and Q
M
178 F. Barbaresco

U(Γ )e−β·U(Γ ) dω
M
=  (7.125)
e−β·U(Γ ) dω
M

with
s(Q) = β · Q − Ω(β), dΩ = dβ · Q and ds = β · dQ (7.126)
⎨ ⎩
A statistical state p(Γ ) is invariant by δ if δ p(Γ ) = 0 for all Γ(then p(Γ ) is invariant
by finite transform of G generated by δ).
J. M. Souriau gave the following theorem:
An equilibrium state allowed by a group G is invariant by an element δ of Lie
Algebra A, if and only if [δ, β] = 0 (with [.], the Lie Bracket), with β the generalized
equilibrium temperature.
 
∂s
with β = ∂Q and M H(Γ ) · p(Γ )dω = Q where s = − M p(Γ ) log p(Γ )dω and

M p(Γ )dω = 1 .
For classical thermodynamic, where G is abelian group of translation with respect
to time t, all equilibrium states are invariant under G.
For Group of transformation of Space–Time, elements of Lie Algebra of G could be
defined as vector fields in Space–Time. The generalized temperature β previously
defined, would be also defined as a vector field. For each point of Manifold M , we
could then define:
• Vector temperature:
V
βM = (7.127)
kT
with
• Unitary Mean Speed:
βM
V= with |V | = 1 (7.128)
|βM |

• Eigen Absolute Temperature:


1
T= . (7.129)
k · |βM |

Classical formula of thermodynamics are thus generalized, but variables are


defined with a geometrical status, like the geometrical temperature βM an element of
the Lie algebra of the Galileo or Poincaré groups, interpreted as the field of space–
time vectors. Souriau proved that in relativistic version βM is a time like vector with
an orientation that characterizes the arrow of time.
7 Eidetic Reduction of Information Geometry 179

7.4.14 Quantum Characteristic Function: Von Neumann Entropy


and Balian Quantum Fisher Metric

Information Geometry metric has been extended in Quantum Physics. In 1986, Roger
Balian has introduced a natural metric structure for the space of states D̂ in quantum
mechanics, from which we can deduce the distance between a state and one of its
approximations [88–90, 92–95]. Based on quantum information theory, Roger Balian
has built on physical grounds this metric as hessian metric:

ds2 = d 2 S(D̂) (7.130)

from S(D̂), von Neumann’s entropy:


$ %
S(D̂) = −Tr D̂ log(D̂) (7.131)

Quantum mechanics uses concepts of


• “observable” Ô: Random variables, elements of a non-commutative C*-algebra
of operators describing a physical quantity of the system like angular momentum.
For finite systems, “observables” Ô are represented by Hermitian matrices acting
in a Hilbert space H.
• “state” E(Ô): Quantum Expected Value considered as Information on the system
allowing to make predictions about it. A quantum state is on the same footing
as a probability distribution in classical physics and a state in classical statistical
mechanics, except for the non-commutation of the observables. (this “expectation”
cannot be given simultaneously for two non-commuting observables which cannot
be measured with the same apparatus).
When observables are represented by matrices in a Hilbert space H, the state can be
implemented in terms of a matrix D̂ in the space H:
$ %
E(Ô) = Tr D̂Ô (7.132)

where D̂ is the density matrix, which is a representation of the state, is


• Hermitian: as E(Ô) is real for Ô+ = Ô
• Normalized: as E(Î) = 1 for the unit observable Î
• Non-negative: as E(Ô2 ) ≥ 0 for any Ô.
$ %
Through E(Ô) = Tr D̂Ô , each state could be considered as a linear mapping
& '
E(Ô) = D̂, Ô of the vector space of observables Ô onto real numbers, charac-
terized by D̂ that appears as elements of the dual vector space of observables Ô.
Liouville representations of quantum mechanics uses that the mathematical repre-
sentation of quantum mechanics can be changed, without any incidence on physics,
180 F. Barbaresco

by performing simultaneously in both vector spaces of observables


& ' and states a linear
change of coordinates that preserves the scalar product D̂,Ô , that leaves invariant
the expectation values & '
E(Ô) = D̂,Ô . (7.133)

Roger Balian has looked for a physically meaningful metric in the space of states,
that depend solely on this space and not on the two dual spaces. Then, he considers
the Entropy S associated with a state D̂ that gathers all the information available on
an ensemble E of systems. The Entropy S quantity measures the amount of missing
information because of probabilistic character of predictions.
Von Neumann has introduced, in 1932, an extension of Boltzmann–Gibbs Entropy
of classical statistical mechanics:
$ %
S(D̂) = −Tr D̂ log(D̂) (7.134)

used by Jaynes to assign a state to a quantum system when only partial information
about it is available based on:
• Maximum Entropy Principle: among all possible states that are compatible with
the known expectation values, the one which yields the maximum entropy under
constraints on these data is selected.
that is derived from Laplace’s principle of insufficient reason:
• Identification of expectation values in the Bayesian sense with averages over a
large number of similar systems
• The expectation value of each observable for a given system is equal to its average
over E showing that equi-probability for E implies maximum entropy for each
system.
Then von Neumann’s entropy has been interpreted by Jaynes and Brillouin as:
• Entropy = measure of missing information: the state found by maximizing this
entropy can be regarded as the least biased one, as it contains no more information
than needed to account for the given data.
Different authors have shown that von Neumann Entropy is the only one that
is physically satisfactory, since it is characterized by the properties required for a
measure of disorder:
• Additivity for uncorrelated systems
• Subadditivity for correlated systems (suppressing correlations raises the uncer-
tainty)
• Extensivity for a macroscopic system
• Concavity (putting together two different statistical ensembles for the same system
produces a mixed ensemble which has a larger uncertainty than the average of the
uncertainties of the original ensembles).
7 Eidetic Reduction of Information Geometry 181

These properties are also requested to ensure that the maximum entropy criterion
follows from Laplace’s principle, since the latter principle provides an unambiguous
construction of von Neumann’s entropy.
Roger Balian has then developed the Geometry derived from von Neumann’s
entropy. using two physically meaningful scalar quantities:
$ %
• Tr D̂Ô expectation values across the two dual spaces of observables & states
$ %
• S = −Tr D̂ log(D̂) entropy within the space of states.
& '
The entropy S could be written as a scalar product S = − D̂, log(D̂) where
log(D̂) is an element of the space of observables, allowing physical geometric struc-
ture in these spaces. The second differential d 2 S is a negative quadratic form of the
coordinates of D̂ that is induced by concavity of von Neumann entropy S. Then,
Roger Balian has introduced the distance ds between the state D̂ and a neighboring
state D̂ + d D̂ as the square root of
$ %
ds2 = −d 2 S = Tr d D̂ · d log D̂ (7.135)

where the Riemannian metric tensor is the Hessian of −S(D̂) as function of a set of
independent coordinates of D̂.
We recover the algebraic/geometric duality between D̂ and ln D̂ through a Legen-
dre transform:
& ' & '
S(D̂) = F(X̂) − D̂, X̂ with S(D̂) = −Tr D̂ log D̂ = − D̂, log D̂ (7.136)

with X̂ and D̂ conjugate variables in the Legendre transform and where:

F(X̂) = log Tr exp X̂ (7.137)

an operators generalization of the Massieu potential. F(X̂) Characterizes the canon-


ical thermodynamic equilibrium states to which it reduces for X̂ = β · Ĥ where Ĥ is
the Hamiltonian.
exp X̂
dF = Tr D̂d X̂ with D̂ = (7.138)
Tr exp X̂

dF the partial derivatives of F(X̂) with respect to the coordinates of X̂ . D̂ is Hermitian,


normalized and positive and can then be interpreted as a density matrix.
Legendre transformation appears naturally with:
182 F. Barbaresco
  ⎟⎟
S(D̂) = −Tr D̂ log D̂ = −Tr D̂ X̂ − log Tr exp X̂
 ⎟
= −Tr D̂X̂ + Tr D̂ · log Tr exp X̂
 ⎟  ⎟ & '
Tr D̂ = 1 ⇒ S(D̂) = F X̂ − D̂, X̂ (7.139)

Roger Balian can then define Hessian metric from F, ds2 = d 2 F in the conjugate
space X̂:
ds2 = −dS 2 = Trd D̂d X̂ = d 2 F (7.140)

Note that the normalization of D̂ implies Trd D̂ = 0 and Trd 2 D̂ = 0.$ %


For time-dependent problems, predictions of E(Ô) = O = Tr D̂Ô , given at
some time, is transformed at later times
• Schrodinger Picture: the observables Ô remain fixed, while the evolution of D̂(t)
is governed by the Liouville-von Neumann equation of motion:

d D̂ $ %
i = Ĥ, D̂ (7.141)
dt

• Heisenberg picture: the density operator D̂ remains fixed while the observables
Ô change in time according to Heisenberg’s equation of motion:

d Ô $ %
i = Ô, Ĥ (7.142)
dt
The Liouville-von-Neumann equation describes the transfer of information from
some observables to other ones generated by a completely known dynamics and
can be modified to account for losses of information during a dynamical process.
The von-Neumann entropy measures in dimensionless units our uncertainty when
summarizes our statistical knowledge on the system, the quantum analogue of the
Shannon entropy.
In the classical limit:
• observables Ô are replaced by commuting random variables, which are functions
of the positions and momenta of the N particles.
• density operators D̂ are replaced by probability densities D in the 6N-dimensional
phase space.
• the trace by an integration over this space.
• the evolution of D is governed by the Liouville equation.
At this step, we make a digression observing that the Liouville equation has been
introduced first by Henri Poincaré in 1906, in a paper entitled “Réflexions sur la
théorie cinétique des gaz ” with this following first sentence:
La théorie cinétique des gaz laisse encore subsister bien des points embarassants pour ceux
qui sont accoutumés à la rigueur mathématique… L’un des points qui m’embarassaient le plus
7 Eidetic Reduction of Information Geometry 183

était le suivant: il s’agit de démontrer que l’entropie va en diminuant, mais le raisonnement


de Gibbs semble supposer qu’après avoir fait varier les conditions extérieures on attend que
le régime soit établi avant de les faire varier à nouveau. Cette supposition est-elle essentielle,
ou en d’autres termes, pourrait-on arriver à des résultats contraire au principe de Carnot en
faisant varier les conditions extérieures trop vite pour que le régime permanent ait le temps
de s’établir?

In this paper, Poincaré introduced Liouville equation for the time evolution of
3 ∂(pXi )
the phase-space probability density p (xi , t) : ∂p
∂t + ∂xi = 0 corresponding to
i
a dynamical system defined by the ordinary differential equations dx dt = Xi . For
i
3 ∂Xi ∂p
a system obeying Liouville theorem ∂xi = 0, Poincaré gave the form ∂t +
i
3 ∂p
Xi ∂xi = 0 that can also be written as: ∂p
∂t = {H, p} = Lp in terms of the Poisson
i
bracket {.,.} of the Hamiltonian H with the probability density p, which defines the
Liouvillian operator L.
Back to Balian theory, we can notice that usually, only a small set of data are
controlled.
• Ôi will be the observables that are controlled
• Oi their expectation values for the considered set of repeated experiments,
• D̂ a density operator of the system, relying on the sole knowledge of the set

Oi = Tr Ôi D̂ (7.143)

The maximum entropy criterion consists in:


• Selecting, among all the density operators subject to the constraints Oi = Tr Ôi D̂,
the one D̂R , which renders the von Neumann entropy S(D̂) = −Tr D̂ log D̂ maxi-
mum.
Roger Balian said that:
An intuitive justication is the following: for any other D̂ compatible with Oi = Tr Ôi D̂, we
have by construction S(D̂) < S(D̂R ). The choice of D̂R thus ensures that our description
involves no more information than the minimum needed to account for the only available
information Oi = Tr Ôi D̂. The difference S(D̂R ) − S(D̂) measures some extra information
included in D̂, but not in D̂R , and hence, not in the only known expectation values Oi .
Selecting D̂R rather than any other D̂ which satisfies Oi = Tr Ôi D̂ is therefore the least
biased choice, the one that allows the most reasonable predictions drawn from the known
set Oi about other arbitrary quantities.

At this stage, Roger Balian has introduced the notion of Relevant Entropy as
generalization of Gibbs entropy. For any given set of relevant observables Ôi , the
expectation values Oi of which are known, the maximum of the von Neumann entropy
S(D̂) = −Tr D̂ log D̂ under the constraints Oi = Tr Ôi D̂ is found by introducing
Lagrangian multipliers, γi associated with each equation Oi = Tr Ôi D̂ and Ω asso-
ciated with the normalization of D̂:
184 F. Barbaresco
$  ⎟%
D̂R = ArgMin S(D̂) + γi Oi − Tr Ôi D̂ such that Tr D̂ = 1 (7.144)

Its value, S(D̂R ) is reached for a density operator of the form:


+ ,

D̂R = exp Ω − γi Ôi (7.145)
i

where the multipliers are determined by Tr D̂R Ôi = Oi and Tr D̂R = 1.


This least biased density operator has an exponential form, which generalizes the
usual Gibbs distributions and the concavity of the von Neumann entropy ensures the
unicity of D̂R .
The equations Tr D̂R Ôi = Oi and Tr D̂R = 1 can be written by introducing a gen-
eralized thermodynamic potential Ω (γi ), defined as function of the other multipliers
γi by: * -

Ω (γi ) = − log Tr exp − γi Ôi (7.146)
i

where Ω (γi ) is identified with a Massieu thermodynamic potential.


The relations between the data Oi and the multipliers γi are then given by:

∂Ω (γi )
= Oi (7.147)
∂γi

The corresponding entropy S(D̂R ) = SR (Oi ) is a function of the variables Oi (or


equivalently γi ), as Ω(γi ):

SR (Oi ) = γi Oi − Ω(γi ) (7.148)
i

Roger Balian gave the name of Relevant Entropy S(D̂R ) = SR (Oi ) associated
with the set Ôi of relevant observables selected in the considered situation, measuring
the amount of information which is missing when only the data Oi are available.
The relations ∂Ω(γ ∂γi
i)
= Oi between the variables γi and the variables Oi can
therefore be inverted as:
∂SR (Oi )
= γi (7.149)
∂Oi

Roger Balian has given an orthogonal projection interpretation of maximum entropy


solution, where in non-equilibrium statistical mechanics; a central problem consists
in predicting the values at the time t of some set of variables Oi from the knowledge
of their values at the initial time t0 . As remarked by Roger Balian, a general solution
7 Eidetic Reduction of Information Geometry 185

of this inference problem is provided by the projection method of Nakajima [116]


and Zwanzig [117].
The equations of motion of Oi (t) should be generated from:

d D̂ $ %
i = Ĥ, D̂ (the motion equation) (7.150)
dt

Oi = Tr Ôi D̂ (7.151)

At initiation, we have:
• Oi (t0 ): initial conditions are transformed into initial conditions on D̂(t0 )
• D̂(t0 ): the least biased choice given by the maximum entropy criterion:
+ ,

D̂R = exp Ω − γi Ôi with Tr D̂R Ôi = Oi and Tr D̂R = 1 (7.152)
i

The von Neumann entropy S(D̂) = −Tr D̂ log D̂ associated


$ % with D̂(t) remains
constant in time, from the unity of the evolution i dt = Ĥ, D̂ (no information is
d D̂

lost during the evolution), but in general, D̂(t) does not keep the exponential form,
which involves only the relevant observables Oi .
The lack of information associated with the knowledge of the variables Oi (t) only
is measured by the Relevant Entropy:

SR (Oi ) = γi Oi − Ω(γi ) (7.153)
i

where the multipliers γi (t) are time-dependent:

∂Ω (γi ) ∂SR (Oi )


= Oi or = γi . (7.154)
∂γi ∂Oi

Generally, SR (Oi (t)) > SR (Oi (t0 )) because a part of the initial information on the
set Ôi has leaked at the time t towards irrelevant variables., due to dissipation in the
evolution of the variables Oi (t).
A reduced density operator D̂R (t) can be built at each time with the set of relevant
variables Oi (t). As regards these variables, D̂R (t) is equivalent to D̂(t):

Tr D̂R (t)Ôi = Tr D̂(t)Ôi = Oi (t) (7.155)

D̂R (t) has the maximum entropy form retaining any information about the irrel-
evant variables, and is parameterized by the set of multipliers γi (t) (in one-to-one
correspondence with the set of relevant variables Oi (t)).
186 F. Barbaresco

Fig. 7.9 Projection of


D̂(t), D̂R (t) that has the max-
imum entropy form retaining
any information about the
irrelevant variables

Regarding density operators as points in a vector space, the correspondence from


D̂(t) to D̂R (t) is interpreted by Roger Balian as a projection onto the manifold R
of reduced states (the space of states D̂ endowed with the previous natural metric:
ds2 = −dS 2 = Trd D̂ · d log D̂).
The correspondence D̂ ◦ D̂R appears as an orthogonal projection [116, 117]
where the projection from D̂ to D̂R = PD̂ is implemented by means of the projection
superoperator:
∂ D̂R  ⎟
P = D̂R ∧ I + ∧ Ôi − Oi I (7.156)
∂Oi

where I is the unit superoperator in the Liouville space. P is defined as a sum of


elementary projections (the tensor product of a state-like vector and an observable-
like vector), lying in the conjugate Liouville spaces. The complementary projection
superoperator Q onto the irrelevant space is defined as:

Q=I −P (7.157)

with the following properties:

P2 = P, Q2 = Q, PQ = 0 and IP = I (7.158)

The density operator D̂ can thereby be split at each time into its reduced relevant
part D̂R = PD̂ and its irrelevant part D̂IR = QD̂ (Fig. 7.9):

D̂ = PD̂ + QD̂ = PD̂ + (I − Q)D̂ = D̂R + D̂IR (7.159)

In Liouville representation, the Liouville-von Neumann equation could be written


with Liouvillian L superoperator (tensor with 2 × 2 indices in Hilbert Space):

d D̂ d D̂αβ   1  
= L D̂ ⇒ = Lαβ,γ δ D̂δγ = Hαδ δβγ − Hγβ δαδ D̂δγ
dt dt i
γβ γβ
(7.160)
7 Eidetic Reduction of Information Geometry 187

The time-dependence of D̂R is given by time-derivative:

d D̂R dP
= D̂R + PL D̂R + PL D̂IR (7.161)
dt dt
d D̂IR dP
= D̂IR − QL D̂R − QL D̂IR
dt dt
Roger Balian solves these equations by introduction a Superoperator Green func-
tion.

7.5 Synthesis of Analogy Between Koszul Information Geometry


Model, Souriau Statistical Physics Model and Balian
Quantum Physics Model

We will synthetize in the following Table 7.1 results of Sect. 7.4 with Koszul Hessian
structure of Information Geometry, Souriau model of Statistical Physics with general
concept of geometric temperature, and Balian model of Quantum Physics with notion
of Quantum Fisher Metric deduced from hessian of Von Neumann Entropy. Analogies
between models will deal with Characteristic function, Entropy, Legendre Transform,
Density of probability, Dual Coordinate System, Hessian Metric and Fisher metric.

7.6 Applications of Koszul Theory to Cones for Covariance


Matrices of Stationary/Non-stationary Time Series and
Stationary Space–Time Series

In this chapter, we apply theory developed in first chapters to define geometry of cones
associated to Symmetric/Hermitian Positive Definite Matrices, that correspond to the
case of covariance matrices for stationary signal [35, 36] with very specific matrix
structures [77–79]:
• Toeplitz Hermitian Positive Definite Matrix structure (covariance matrix of a sta-
tionary time series)
• Toeplitz-Block-Toeplitz Hermitian Positive Definite Matrix structure (covariance
matrix of a stationary space–time series).
We will prove that “Toeplitz” structure could be captured by complex autogressive
model parameterization. This parameterization could be naturally introduced through
Trench’s theorem, Verblunsky’s theorem or Partial Iwasawa decomposition theorem.
188

Table 7.1 Synthesis of Koszul, Souriau and Balian models respectively in hessian geometry, physical statistics and quantum physics
Koszul model Souriau model Balian model
⎜ 
  3
Characteristic function Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ Ω(β) = − log e−β·U(Γ ) dω Ω (γi ) = − log Tr exp − γi Ôi
Φ⊂ M i
  S(D̂R ) = SR (Oi ) ⎟
Entropy Ω⊂ (x ⊂ ) =− px (Γ ) log px (Γ )dΓ s=− p(Γ ) log p(Γ )dω
Φ⊂ M S(D̂R ) = −Tr D̂R log D̂R
3
Legendre transform Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈ − Ω(x) s(Q) = β · Q − Ω(β) SR (Oi ) = γi Oi − Ω(γi )
i
px (Γ ) = e−∗x,Γ ∈+Ω(x) ⎪ 
 3
Density of probability px (Γ ) = e−∗Γ,x∈ / e−∗Γ,x∈ dΓ p(Γ ) = eΩ(β)−β·U(Γ ) D̂R = exp Ω − γi Ôi
Φ⊂ i

 U(Γ )e−β·U(Γ ) dω
M 
Tr D̂R (t)Ôi = Tr D̂(t)Ôi
Dual coordinate x⊂ = Γ · px (Γ )dΓ Q= e−β·U(Γ ) dω
Φ⊂ Tr D̂R (t)Ôi = Oi (t)
M
∂Ω⊂ (x ⊂ ) ∂s ∂Ω(γi ) R (Oi )
Dual coordinate system x ⊂ = ∂Ω(x)
∂x and x = ∂x ⊂ Q = ∂Ω
∂β and β = ∂Q ∂γi = Oi and ∂S∂O i
= γi
Hessian metric ds2 = −d 2 Ω⊂ (x ⊂ ) ds2 = −d 2 s(Q) ds2 = 2
−dS = Trd D̂d X̂ = d 2 F
$ %
∂ 2 log px (Γ ) $ %
I(x) = −EΓ Ω(β) 2 2Ω
Fisher metric ∂x 2
2 I(β) = − ∂ ∂β 2 I(γ ) = − ∂γ∂ i ∂γ j
Ω(x) i,j
I(x) = − ∂ ∂x 2
F. Barbaresco
7 Eidetic Reduction of Information Geometry 189

7.6.1 Koszul Density for Symmetric Positive Definite (SPD) Matrix

(a) Classical approach based on Maximum Entropy


We can construct a probability model for symmetric positive-definite real random
matrices using the entropy optimization principle [58]. Then, we will compare by the
result obtained by Koszul theory. Let x be a random matrix with values in Sym+ (n)
whose probability distribution p (x) d̃x is defined by a probability density function
p (x).This probability density function is such that:

p (x) d̃x = 1 (7.162)
x>0

Available information for construction of the probability model


⎞ ⎞
p (x) d̃x = 1 with E (x) = x · p (x) d̃x = x̄ (7.163)
x>0 x>0

and E (log det x) = log (det x) .p (x) d̃x = v with |v| < +→ (7.164)
x>0

⎨ ⎩ $) )γ %
) )
where E log det x = v with |v| < +→ ⇒ E )x −1 ) < +→ (7.165)
F

Probability model using the maximum entropy principle: construct probability


density function and characteristic function of random matrix x using the maximum
entropy principle and the available information.
⎛ ⎠
⎞ ⎞
L(p) = − p(x) log p(x) · d̃x − (Γ0 − 1) · ⎝ p(x)d̃x − 1⎧
x>0 x>0
⎛ ⎠

−Γ ·⎝ log (det x) · p(x)d̃x − v⎧ (7.166)
x>0

p = Arg Max L(p) ⇒ p(x) = c0 · (det x)λ−1 e−∗Γ,x∈ (7.167)


p

In order to perform this calculation, we need results concerning the Siegel integral
for a positive-definite symmetric real matrix, given by Siegel in hi paper “Über der
analytische Theorie der quadratischen Formen”:

J (λ, Γ ) = e−∗Γ,x∈ [det x] λ−1 d̃x (7.168)
x>0
190 F. Barbaresco

That could be computed:


⎜M  ⎟
M(M−1) 4 (M−1+2λ)
J (λ, Γ ) = (2π ) 2 Γ M−l+2λ
2 (det x)− 2
l=1 (7.169)

+→
Γ (λ) = t λ−1 e−t dt
0

If we compute the Characteristic function of x is given by:


$ % ⎞
Ωx (θ ) = E ei∗θ,x∈ = ei∗θ,x∈ p(x) · d̃x (7.170)
x>0

$  ⎟%− (M−1+2λ)
Ωx (θ ) = det IM − iΓ −1 θ
2
(7.171)

Considering the analytic extension of the mapping w ∨◦ J (λ, Γ − w) by writing


w = Re(w) + iθ . Taking Re(w) = 0, we deduce that:

1
Ωx (θ ) = c0 · J (λ, Γ − iθ) and Ωx (0) = 1 ⇒ c0 = (7.172)
J(λ, Γ )

Finally, characteristic function of x is given by:

Ωx (θ ) = J (λ, Γ − iθ) · J (λ, Γ )−1 (7.173)

First and second moments derived from characteristic function


)
⎨ ⎩ −i ∂Ωx (θ ) )) ⎨ ⎩
E xjk =   ) and E xjk xj≥ k ≥
2 − δjk ∂θjk θ=0
)
−i ∂ 2 Ωx (θ ) ))
=   (7.174)
2 − δjk 2 − δj≥ k ≥ ∂θjk ∂θj≥ k ≥ )θ=0

Using:
 ⎟ ∂ det b
det (I + h) = 1 + tr(h) + 0 |h|2F and = (det b) · b−1 (7.175)
∂bjk

⎨ ⎩ (M − 1 + 2λ) −1
E xjk = · Γjk (7.176)
2
7 Eidetic Reduction of Information Geometry 191

⎨ ⎩ ⎨ ⎩ ⎨ ⎩ 1  ⎨ ⎩ ⎨ ⎩
E xjk xj≥ k ≥ = E xjk E xj≥ k ≥ + E xj≥ k E xjk ≥
(M − 1 + 2λ)
⎨ ⎩ 
+E xjj≥ E [xkk ≥ ] (7.177)

First moment derived from characteristic function are given by: Γ = (M−1+2λ)
2 x̄ −1 .
Characteristic function and probability density function of positive-definite ran-
dom matrix:
⎪ ⎜ − (M−1+2λ)
2i 2
Ωx (θ ) = det IM − x̄θ (7.178)
(M − 1 + 2λ)

We deduce first two moments:


(M−1+2λ)  −1  (M−1+2λ)  −1 
p(x) = cx · (det x)λ−1 · e− 2 x̄ ,x
= cx · (det x)λ−1 · e− 2 Tr x̄ x
 ⎟ M(M−1+2λ)
− M(M−1) M−1+2λ 2
(2π) 4
2
with cx = *
 ⎟
-
4
M
M−l+2λ
(M−1+2λ)
 2 ·(det x̄) 2
l=1
(7.179)
If we Maximum Entropy Density with λ = 1
 ⎟ n+1 (n+1)  −1 
px (Γ ) = βME · det Γ̄ −1 e− 2 Tr Γ̄ Γ with
2

⎤ ⎫
 (2π )− n(n−1)  n+1
 n(n+1)
2 
 4

βME =  ⎜ n  2 ⎟  (7.180)
⎥ 4 ⎬
 n−l+2 2
l=1

(b) Method based on Koszul Density


Another approach is to consider directly Koszul theory with following definition of
the density:
 
e− Γ,Ψ (Γ̄ )
−1
dΩ(x)
pΓ̄ (Γ ) =  with x = Ψ−1 (Γ̄ ) and Γ̄ = Ψ(x) = (7.181)
e−∗ Γ,Ψ −1 (Γ̄ )∈
dΓ dx
Φ⊂
⎞ ⎞
where Γ̄ = Γ · pΓ̄ (Γ )dΓ and Ω(x) = − log e−∗x,Γ ∈ dΓ (7.182)
Φ⊂ Φ⊂

To compute this density, we have to define first the inner product ∗., .∈ for our
application considering Symmetric or Hermitian Positive Definite Matrices, that
is a reductive homogeneous space. A homogeneous space G/H is said to be
reductive if there exists a decomposition g = m + h , such that AdH (m) =
192 F. Barbaresco
 
hAh−1 : h ≡ H, A ≡ m ⊗ m, where g and h are the Lie algebras of G and H. Given a
bilinear form on m , there are associated G-invariant metric and G-invariant affine con-
nection on G/H. The natural metric on G/H corresponds to a restriction of the Cartan–
Kiling form on g (Lie Algebra of G). For covariance matrices, g = gl(n) is the Lie
algebra of nxn matrices, that could be written as the direct sum gl(n) = m+h, with m
Symmetric or Hermitian matrices and h = so(n) sub-Lie  algebra
 of skew-symmetric
 
or h = u(n) skew-hermitian matrices: A = 1/2 · A + AT + 1/2 · A − AT . The
symmetric matrices m are AdO(n) -invariant because for any symmetric matrix S and
orthogonal matrix Q, AdO(n) (S) = QSQ−1 = QSQT , that is symmetric.
As the set of covariance symmetric matrices is related to the quotient space
GL(n)/O(n) with O(n) orthogonal Lie Group (because for any covariance matrix R,
there is an equivalence class R1/2 O(n)), and the set of covariance hermitian matrices
is related to the quotient space GL(n)/U(n) with U(n) Lie Group of Unitary matrices,
covariance matrices admit a GL(n) -invariant metric and connection corresponding
to the following Cartan–Killing bilinear form (Elie Cartan has introduced this form
in his Ph.D.):
∗X, Y ∈I = Tr (XY ) at R = I (7.183)
 ⎟
gR (X, Y ) = ∗X, Y ∈R = Tr XR−1 YR−1 at arbitrary R (7.184)

With this inner product given by Cartan–Killing bilinear form, we can then developed
the expression of the Koszul density:


 ∗x, y∈ = Tr (xy) , ∀x, y ≡ Symn (R)

  n+1
 ΔΦ (x) = e−∗Γ,x∈ dΓ = det x − 2 Δ(In )
Φ⊂ ∗x, y∈ = Tr(xy) (7.185)

 Φ⊂ = Φ self-dual


 ⊂ n+1 −1
x = Γ̄ = −d log ΔΦ = n+1 2 d log det x = 2 x

By Koszul formula, we have then directly


n+1
$  ⎟%α  −1 
px (Γ ) = e−Tr(xΓ )+ 2 log det x = det α Γ̄ −1 e−Tr α Γ̄ Γ with

Γ̄ = Γ · px (Γ ) · dΓ (7.186)
Φ⊂

But if we use normalization constraint:


 n+1 
px (Γ )d̃Γ = 1 ⇒ det (ν) 2 e−Tr(νΓ ) = cΓ̄−1 with ν = n+1 −1
2 Γ̄
Γ ⎜ n  Γ >0⎟
4 (7.187)
−1 n(n−1)
cΓ̄ = (2π ) 2  n−l+2
2
l=1
7 Eidetic Reduction of Information Geometry 193

That we could write:


 ⎟ n+1 n+1  −1 
px (Γ ) = e−∗x,Γ ∈+
n+1
= cΓ̄ · βK det Γ̄ −1 e− 2 Tr Γ̄ Γ with
log det x 2
2

⎪  n+1
n+1 2
βK = (7.188)
2

7.7 Geometry of Toeplitz and Toeplitz-Block-Toeplitz Hermitian


Positive Definite Matrices and Representation in Product
Space RxDm−1 xSDn−1

The geometry of Toeplitz and Toeplitz-Block-Toeplitz Hermitian Positive Definite


Matrices (THPD and TBTHPD matrices) is addressed in this chapter. In order to
introduce with no arbitrary, the relevant metric, we will jointly introduce it through:
• Information Geometry (Fisher metric that is invariant by all changes of parame-
terization)
• Cartan–Siegel Homogenous Bounded Domains geometry [47, 48] (Siegel metric,
deduced from Symplectic geometry , that is invariant by all automorphisms of
these bounded domains).
Both geometries will provide the same metric, with representation of Toeplitz-
Block-Toeplitz Hermitian Positive Definite Matrices in Product Space RxDm−1 x
SDn−1 (with D Poincaré unit disk and SD Siegel unit disk).
To deduce metric of Hermitian Postive Definite Matrices with an additional
Toeplitz matrix structure, we use the Trench [51] theorem proving that there is a
diffeomorphism between THPD matrices and a specific parameterization deduced
from the Toeplitz matrix structure. By analogy, if we consider this THPD, as a covari-
ance matrix of a stationary signal, this parameterization could be exactly identified
with Complex Auto-Regressive (CAR) model of this signal. All THPD matrices are
then diffeomorphic to (r 0 , μ1 , . . ., μn ) ≡ R+ xDn (r0 is a real “scale” parameter, μk
are called reflection/Verblunsky coefficients of CAR model in D the complex unit
Poincare disk, and are “shape” parameters). This result has been found previously by
Samuel Verblunsky in 1936 [52]. We have observed that this CAR parameterization
of the THPD matrix could be also interpreted as Partial Iwasawa decomposition of
the matrix in Lie Group Theory. At this step, to introduce the “natural” metric of
this model, we use jointly Burbea/Rao results in Information Geometry and Koszul
results in Hessian geometry, where the conformal metric is given by the Hessian of
the Entropy of the CAR model. This metric has all good properties of invariances and
is conformal. To regularize the CAR inverse problem, we have used a “regularized”
Burg reflection coefficient avoiding prior selection of AR model order.
194 F. Barbaresco

For TBTHPD matrices, we used matrix extension of Verblunsky theorem (given


the diffeomorphism of TBTHPD matrix with (R0 , M 1 , . . ., M n ) ≡ Herm(n)+
xSDn ) and Matrix-Burg like algorithm to compute a Matrix CAR model, where
Verblunsky coefficients Mk are no longer in unit Poincare disk but in unit Siegel
disk SD.

7.7.1 Metric of Toeplitz Hermitian Positive Definite Covariance


Matrices

Classicaly, when we consider covariance⎨ matrix⎩ of stationary signal, correlation coef-


ficient are given by rn,n−k = rk = E zn zn−k⊂ and the covariance matrix Rn has a
Toeplitz structure. To take into account this matrix structure constraint, Partial Iwa-
sawa decomposition should be considered. This is equivalent for time or space signal
to Complex AutoRegressive (CAR) Model decomposition (see Trench Theorem):
* -
 ⎟ 1 A+
−1
Wn · Wn+
n−1
Φn = (αn · Rn ) = = 1 − |μn | · A
2 +
n−1 Φn−1 + An−1 · An−1

(7.189)
7 ⎜ 
1 0 1/2 1/2+
Wn = 1 − |μn | 2
1/2 with Φn−1 = Φn−1 · Φn−1 (7.190)
An−1 Φn−1

where
$ % ⎜  ⎜ (−) 
−1 An−1 An−1
αn−1 = 1 − |μn |2 ·αn−1 , An = +μn and V (−) = J ·V ⊂ (7.191)
0 1

with J antidiagonal matrix


In the framework of Information Geometry, Information metric could be intro-
duced as Kählerian metric where Kähler potential is given by the Entropy Ω̃(Rn , P0 ):

 ⎟ 
n−1 $ %
Ω̃(Rn , P0 ) = log det Rn−1 − log (π · e) = − (n − k) · log 1 − |μk |2
k=1
− n · log [π · e · P0 ] (7.192)

Information metric is then given by hessian of Entropy in Iwasawa parameterization:

∂ 2 Ω̃ ⎨ ⎩T
gij ∓ where θ (n) = P0 μ1 · · · μn−1 (7.193)
∂θi(n) ∂θj(n)⊂
7 Eidetic Reduction of Information Geometry 195

−1
with {μk }n−1
k=1 Regularized Burg reflection coefficient [132] and P0 = α0 mean
signal Power. Kählerian metric (from Information Geometry) is finally:
⎪ 2 
n−1
(n)+
⎨ ⎩ (n) dP0 |dμi |2
dsn2 = dθ gij dθ = n · + (n − i)  2 (7.194)
P0 1 − |μi |2
i=1

7.7.2 Metric of Toeplitz-Block-Toeplitz Hermitian Positive Definite


Covariance Matrices

Previous approach can be extended to Toeplitz-Block-Toeplitz Hermitian Positive


Definite Covariance Matrices (TBTHPD). Based on “matrix” generalization of
Trench Algorithm, if we consider TBTHPD matrix [82, 83, 86, 87]:
⎤ ⎫
R0 R1 · · · Rn
 + ..  ⎜ 
 R R0 . . . . 
Rp,n+1 = 1  = Rp,n R̃n (7.195)
 .. . . . .  R̃n+ R0
⎥ . . . R1 ⎬
Rn+ · · · R1+ R0
⎤ ⎫
⎤ ⎫⊂ 0 ··· 0 Jp
R1  .. . . . . 
   . . . 
with R̃n = V ⎥ ... ⎬ and V = 
0 
 ..  (7.196)
⎥ 0 Jp . . . . ⎬
Rn
Jp 0 · · · 0

From Burg-like parameterization, we can deduced this inversion of Toeplitz-


Block-Toeplitz matrix:
⎤ 

−1 ⎥ αn αn · A + ⎬
Rp,n+1 = 
n
  (7.197)
−1 + α · A · A +
αn · A n Rp,n n n n

and * -
  
αn−1 + A + +
n · Rp,n · A n − A n · Rp,n
Rp,n+1 =  (7.198)
−Rp,n · A n Rp,n

with ⎨ ⎩ −1
αn−1 = 1 − Ann An+
n · αn−1 , α0−1 = R0

and
196 F. Barbaresco
⎤ ⎫
⎤ ⎫ Jp An−1⊂
n−1 Jp
*
A11 -
  .. 
   
A n = ⎥ ... ⎬ = A n−1 + Ann ·  .  (7.199)
0 p ⎥ Jp A1 Jp ⎬
n−1⊂
Ann
Ip

Where we have the following Burg-like generalized forward and backward linear
prediction:

 3 n+1
n+1
 f
 εn+1 (k) =
f
Al (k)Z(k − l) = εn (k) + An+1
n+1 εn (k − 1)
b
l=0
 b (k) = 3 JAn+1 (k)⊂ JZ(k − n + l) = ε b (k − 1) + JAn+1⊂ Jε f (k)
n

 εn+1 n
l n n+1
l=0

with 
f
ε0 (k) = ε0b (k) = Z(k)
An+1
0 = Ip

*N+n - *N+n -−1


  
N+n
+
An+1
n+1 = −2 εnf (k)εnb (k − 1) εnf (k)εnf (k)+ + εnb (k)εnb (k)+
k=1 k=1 k=1
(7.200)
Using Schwarz’s inequality, it is easily to prove that An+1n+1 Burg-Like reflection
coefficient matrix lies in Siegel Disk An+1
n+1 ≡ SDp .
Then as we have observed, metric and geometry of Toeplitz-Block-Toeplitz matri-
ces are related to geometry of Siegel Disk considered as an Hermitian Homogeneous
bounded domain [34].
Siegel Disk has been introduced by Carl Ludwig Siegel [56, 84] through Sym-
plectic Group Sp2n R that is one possible generalization of the group SL2 R = Sp2 R
(group of invertible matrices with determinant 1) to higher dimensions. This gener-
alization goes further; since they act on a symmetric homogeneous space, the Siegel
upper half plane, and this action has quite a few similarities with the action of SL2 R
on the Poincaré’s hyperbolic plane. Let F be either the real or the complex field, the
Symplectic Group is the group of all matrices M ≡ GL2n F satisfying:
0 1 ⎪ 
0 In
Sp(n, F) ∓ M ≡ GL(2n, F)/M T JM = J , J = ≡ SL(2n, R)
−In 0
(7.201)
or ⎪ 
A B
M= ≡ Sp(n, F) ∇ AT C and BT D symmetric (7.202)
CD

and
AT D − C T B = In
7 Eidetic Reduction of Information Geometry 197

The Siegel upper half plane is the set of all complex symmetric n × n matrices
with positive definite imaginary part:

SHn = {Z = X + iY ≡ Sym(n, C)/Im(Z) = Y > 0} (7.203)

The action of the Symplectic Group on the Siegel upper half plane is transitive.
The group PSp(n, R) ∓ Sp(n, R)/ {±I2n } is group of SHn biholomorphisms via
generalized Möbius transformations:
⎪ 
A B
M= ⇒ M(Z) = (AZ + B) (CZ + D)−1 (7.204)
CD

PSp(n, R) acts as a sub-group of isometries. Siegel has proved that Symplectic


transformations are isometries for the Siegel metric in SHn . It can be defined on SHn
using the distance element at the point Z = X + iY , as defined by:
  ⎟
2
dsSiegel = Tr Y −1 (dZ) Y −1 dZ + with Z = X + iY (7.205)

with associated volume form:


 ⎟
Φ = Tr Y −1 dZ √ Y −1 dZ + (7.206)

C. L. Siegel has proved that distance in Siegel Upper-Half Plane is given by:
+ n ⎪ ˘ ,
 1 + rk
2
dSiegel (Z1 , Z2 ) = log2 ˘ with Z1 , Z2 ≡ SHn (7.207)
1 − rk
k=1

and rk eigenvalues of the cross-ratio:


 −1  +  −1
R (Z1 , Z2 ) = (Z1 − Z2 ) Z1 − Z2+ Z1 − Z2+ Z1+ − Z2 (7.208)

This is deduced from the second derivative of Z ◦ R (Z1 , Z) in Z1 = Z given by:


 −1 + +
D2 R = 2dZ Z − Z + dZ (Z − Z)−1 = (1/2) · dZY −1 dZ + Y −1 (7.209)

and  ⎟  ⎟
ds2 = Tr Y −1 dZY −1 dZ + = 2 · Tr D2 R (7.210)

In parallel, in China in 1945, Hua Lookeng has given the equations of geodesic in
Siegel upper-half plane:
d2Z dZ dZ
2
+ i Y −1 =0 (7.211)
ds ds ds
198 F. Barbaresco

Fig. 7.10 Geometry of Siegel upper half-plane

Using generalized Cayley transform W = (Z − iIn ) (Z + iIn )−1 , SiegelUpper-half


Plane SHn is transformed in unit Siegel disk SDn = W /WW + < In where the
metric in Siegel Disk is given by:
$ −1  −1 %
ds2 = Tr In − WW + dW In − W + W dW + (7.212)

The differential equation of the geodesics in Siegel Unit disk is then given by:

d2W  −1
2
+ 2WW + I − WW + W =0 (7.213)
ds

Contour of Siegel Disk is called its Shilov boundary ∂SDn = W /WW + −
In = 0n }. We can also defined horosphere. Let U ≡ ∂SDn and k ≡ R⊂+ , the fol-
lowing set is called horosphere in siegel disk (Fig. 7.10):
     
H (k, U) = Z/0 < k I − Z + Z − I − Z + U I − U + Z
( )) ))2
)) 1 )) k
= Z/ ))))Z − U )))) < (7.214)
k+1 k+1

Hua Lookeng and Siegel have proved that the previous positive definite quadratic
differential is invariant
⎜  under the group
⎜ of automorphisms
 ⎜ ofthe Siegel Disk. Con-
A B ⊂ In 0 In 0
sidering M = such that M M= :
CD 0 −In 0 −In
7 Eidetic Reduction of Information Geometry 199

V = M(W ) = (AZ + B) (CZ + D)−1


 −1  −1 +
⇒ In − VV + dV In − V + V dV
 +
 +
−1  −1  −1
= BW + A In − WW dW In − W + W dW + BW + + A
⇒ dsV2 = dsW 2
(7.215)
To go further to study Siegel Disk geometry, we need now to define what are the
automorphisms of Siegel Disk SDn . They are all defined by:

∀ ≡ Aut(SDn ), ˇU ≡ U (n, C) /(Z) = UΩZ0 (Z)U t (7.216)


 −1/2  −1  1/2
with Σ = ΩZ0 (Z) = I − Z0 Z0+ (Z − Z0 ) I − Z0+ Z I − Z0+ Z0
(7.217)
and its inverse:
 1/2  −1/2  −1
G = I − Z0 Z0+ Σ I − Z0+ Z0 = (Z − Z0 ) I − Z0+ Z
Z = Ω−1 +
Z0 (Σ) = (GZ0 + I)
−1 (G + Z )
0 (7.218)
⇒  + 1/2  + −1/2
with G = I − Z0 Z0 Σ I − Z0 Z0

By analogy with Poincaré’s unit Disk, C. L. Siegel has deduced geodesic distance
in SDn : ⎪ 
1 1 + |ΩZ (W )|
∀Z, W ≡ SDn , d(Z, W ) = log (7.219)
2 1 − |ΩZ (W )|

Information metric will be then introduced


 as a Kähler potential defined by
Hessian of multi-channel entropy Ω̃ Rp,n+1 :
   ⎟
−1 +cte = −Tr log R  +cste ⇒ g = Hess ⎨φ R ⎩ (7.220)
Ω̃ Rp,n = log det Rp,n p,n ij̄ p,n

Using partitioned matrix structure of Toeplitz-Block-Toeplitz matrix Rp,n+1 ,


 n−1
recursively parameterized by Burg-Like reflection coefficients matrix Akk k=1 with
Akk ≡ SDn , we can give Multi-variate entropy, matrix extension of previous Entropy:


n−1 $ %
Ω̃(Rp,n ) = − (n − k) · log det I − Akk Ak+
k − n · log [π · e · det R0 ] (7.221)
k=1

Paul Malliavin has proved that this form is a Kähler Potential of an invariant
Kähler metric (Information Geometry metric in our case). The metric is given by
matrix extension as Hessian of this entropic potential ds2 = d 2 Ω̃(Rp,n ):
200 F. Barbaresco
⎜ ⎟2 
ds2 = nTr R0−1 dR0


n−1 ⎜ ⎟−1  ⎟−1 
+ (n − k)Tr I − Akk Ak+
k dAk
k I − Ak+ k
k Ak dAk+
k (7.222)
k=1

Then finally, we have proved that the geometry of Toeplit-Block-Toeplitz covariance


matrix of a space–time time serie is given by the following conformal metric:

3
m−1
(m − i)  |dμi | 2 2
2
ds2 = n · m · (d log (P0 ))2 + n ·
i=1 1−|μi |
⎜ ⎟−1  ⎟−1  (7.223)
3
n−1
k+ k+ k k+
+ (n − k)Tr I − Ak Ak k dAk I − Ak Ak
k dAk
k=1

Where parameters are in Poincaré disk and Siegel disk:


 ⎟  
R0 , A11 , . . . , An−1 n−1 with SD = Z/ZZ + < I
n−1 ≡ THPDm × SD m
(7.224)
R0 ◦ (log (P0 ) , μ1 , . . . , μm−1 ) ≡ R × Dm−1 with D = {z/zz⊂ < 1}

We have finally proved that for a space—time state of a stationary time series
the TBTHPD covariance matrix is coded in R × Dm−1 × SDn−1
This result could be intuitively recovered and illustrated in the case for m = 2 and
n = 1 (Fig. 7.11).
⎜ 
h a − ib
R= > 0, det R > 0 ∇ h2 > a2 + b2 (7.225)
a + ib h

where we can rewrite the matrix:


⎜ 
1 μ⊂ ⊂ a + ib
R=h· > 0, h ≡ R+ ,μ = ≡ D = {z/ |z| < 1} (7.226)
μ 1 h
⊂ and by μ ≡ D.
R is then parameterized by h ≡ R+
We can observe that this parameterization {log(h), μ} ≡ R × D is linked with
Hadamard compactification of R × D described in [131] (Fig. 7.12).

7.7.3 Extension to Non-stationary Signal: Geometry of THPD


Matrices Time Series Considered as Paths on Structured
Matrix Manifolds

In his book on Creative Evolution, Bergson has written about “Form and Becoming”:
“life is an evolution. We concentrate a period of this evolution in a stable view which
7 Eidetic Reduction of Information Geometry 201

Fig. 7.11 Bounded homogeneous cone associated to a 2 × 2 symmetric positive definite matrix

Fig. 7.12 Hadamard compactification (illustration from “Géométrie des bords: compactifications
différentiables et remplissages holomorphes”, Benoît Kloeckner) [131]

we call a form, and, when the change has become considerable enough to overcome
the fortunate inertia of our perception, we say that the body has changed its form.
But in reality the body is changing form at every moment; or rather, there is no form,
since form is immobile and the reality is movement. What is real is the continual
change of form: form is only a snapshot view of a transition”.
In Sect. 7.7.1, we have considered stationary signal, but we could envisage to
extend the application for non-stationary signal, defined as a continual change of
covariance matrices. We will then extend approach of Sect. 7.7.1 to non-stationary
time series. Many methods have been explored to address model of non-stationary
time series [134–141]. We propose to extend the previous geometric approach for
non-stationary signal corresponding to fast time variation of a time series. We will
assume that each non-stationary signal in one time series can be split into several
stationary signals on a shorter time scale, represented by time sequence of stationary
202 F. Barbaresco

covariance matrices or a geodesic polygon on covariance matrix manifold. We adapt


the Fréchet distance [125, 126, 166, 167] between two curves in the plane with a
natural extension to more general geodesic curves in abstract metric spaces used for
covariance matrix manifold.

7.7.3.1 General Definition of Fréchet Surface Metric

A Fréchet surface is an equivalence class of parameterized surfaces in a metric space,


where surfaces are considered independently of how they are parameterized [125].
Maurice Fréchet has introduced a distance for Surfaces.
Let (X, d) be a metric space, M 2 a compact two-dimensional manifold, f a con-
tinuous mapping f : M 2 ◦ X, called a parameterized surface, and σ : M 2 ◦ M 2 a
homeomorphism of M 2 onto itself. Two such surfaces f1 and f2 are equivalent if:
⎨ ⎩
Inf Max d f1 (p), f2 (σ (p)) = 0 (7.227)
σ p≡M 2

where the infimum is taken over all possible σ . A class f ⊂ of parameterized surfaces,
equivalent to f, is called a Fréchet surface. It is a generalization of the notion of a
surface in Euclidean space to the case of an arbitrary metric space (X, d).
The Fréchet surface metric between Fréchet surfaces and f2⊂ is:
⎨ ⎩
Inf Max d f1 (p), f2 (σ (p)) (7.228)
σ p≡M 2

where the infimum is taken over all possible homeomorphisms σ .


Let (X, d) be a metric space. Consider a set F of all continuous mappings
f : A ◦ X, g: B ◦ X, . . . , where A,B, . . . are in Rn ,are homeomorphic to [0, 1]n
for a fixed dimension n.
The Fréchet semimetric on F is:
⎨ ⎩
Inf Sup d f (x), g (σ (x)) (7.229)
σ x≡A

where the infimum is taken over all orientation preserving homeomorphisms σ :


A ◦ B. It becomes the Fréchet metric on the set of equivalence classes:

f ⊂ = {g/dF (g, f ) = 0} (7.230)

For n = 1 and (X, d) is Euclidean space Rn , this metric is the original Fréchet distance
introduced in 1906 between parametric curves f , g: [0, 1] ◦ Rn . For more details
on Fréchet Surface Metric, read the “Dictionary of Distances” of Elena and Michel
Deza [187].
7 Eidetic Reduction of Information Geometry 203

Fig. 7.13 Fréchet distance between two polygonal curves (and indexing all matching of points)

More recently, other distance have been introduced between opened curves
[195–199] based on diffeomorphism theory, but for our application, we will only
used the Fréchet distance and its extension on manifold.

7.7.3.2 Fréchet Distance Between Geodesic Paths on Manifold

In this section, we recall classical definition of distance between paths in Euclidean


space. If we consider two curves, we can define similarity between each other. Classi-
cally, Hausdorff distance, that is the maximum distance between a point on one curve
and its nearest neighbor on the other curve, is classically used but it does not take
into account the flow and orientation of the curves. The Fréchet distance [125, 126]
between two curves is defined as the minimum length of a leash required to connect
a dog and its owner as they walk without backtracking along their respective curves
from one endpoint to the other. The Fréchet metric takes the flow of the two curves
into account; the pairs of points whose distance contributes to the Fréchet distance
sweep continuously along their respective curves (Fig. 7.13).
Let P and Q be two given curves, the Fréchet distance between P and Q is defined
as the infimum over all reparameterizations α and β of [0, 1] of the maximum over all
t ≡ [0, 1] of the distance in between P(α(t)) and Q(β(t)) . In mathematical notation,
the Fréchet distance dFr échet (P, Q) is:

dFr échet (P, Q) = Inf Max {d (P (α(t)) , Q (β(t)))}
α,β i≡[0,1] (7.231)
α and β: [0,1] ◦ [0, 1] Non decreasing and surjective

Alt and Godau [127] have introduced a polynomial-time algorithm to compute


the Fréchet distance between two polygonal curves in Euclidean space. For two
polygonal curves with m and n segments, the computation time is O (mn log(mn)).
Alt and Godau have defined the free-space diagram between two curves for a given
distance threshold ε is a two-dimensional region in the parameter space that consist
 two curves at distance at most ε as given in the following
of all point pairs on the
equation: Dε (P, Q) = (α, β) ≡ [0, 1]2 /dFr échet (P(α(t)), Q(β(t))) ⊆ ε .
204 F. Barbaresco

Fig. 7.14 Fréchet free-space diagram for two polygonal curves P and Q with monotonicity in both
directions

The Fréchet distance dFr échet (P, Q) is at most ε if and only if the free-space
diagram Dε (P, Q) contains a path which from the lower left corner to the upper right
corner which is monotone both in the horizontal and in the vertical direction.
In an n × m free-space diagram, shown in Fig. 7.14, the horizontal and vertical
directions of the diagram correspond to the natural parameterizations of P and Q
respectively. Therefore, if there is a monotone increasing curve from the lower left
to the upper right corner of the diagram (corresponding to a monotone mapping), it
generates a monotonic path that defines a matching between point-sets P and Q.

7.7.3.3 Fréchet Distance of Geodesic Paths for THPD Time Series on


Structured Matrix Manifolds

We extend previous distance between paths to Information Geometry Manifold of


Toeplitz Hermitian Positive Definite matrices. When the two curves are embedded
in a more complex metric space [128], the distance between two points on the curves
is most naturally defined as the geodesic length of the shortest path between them.
If we consider N subsets of M pulses in the burst, the burst can then be described by
a poly-geodesic lines on Information Geometry Manifold. The set of N covariances
matrices {R (t1 ) , R (t2 ) , . . . , R (tN )} describe a discrete “polygonal” geodesic path
on Information Geometry Manifold, and we can extend previous Fréchet Distance
but with Geodesic distance previously introduced in Sect. 7.7.2
⎭  

 dFr échet (R1 , R2 ) = Inf t≡[0,1]
Max dgeo (R1 (α(t)), R2 (β(t)))
α,β
)  ⎟)2
 with dgeo (R1 (α(t)), R2 (β(t)))) = ))log R1−1/2 (α(t))R2 (β(t))R1−1/2 (α(t)) ))
 2

(7.232)
As classical Fréchet distance doesn’t take into account with Inf[Max] close depen-
dence of elements between points of time series paths, we propose to define a new
distance given by:
7 Eidetic Reduction of Information Geometry 205

Fig. 7.15 Geodesic path on information geometry manifold where non stationary burst is decom-
posed on a sequence of stationary covariance matrices on THPD matrix manifold

⎭ 1 
⎞ 
dgeo-path (R1 , R2 ) = Inf dgeo (R1 (α(t)), R2 (β(t))) dt (7.233)
α,β  
0

We have then to find the solution for computing the geodesic minimal path [133]
on the Fréchet free-space diagram.
 The length of the path is not given by euclidean
metric ds2 = dt 2 (where L = ds ) but geodesic metric weighted by d(.,.) of the
L
free-space diagram:
⎞ ⎞
Lg = g · ds = dsg with dsg = d (R1 (α(t), R2 (β(t))) · dt (7.234)
L L

This optimal shortest path could be computed by classical “Fast Marching method”
(Fig. 7.15).

7.8 New Prospects: From Characteristic Functions to Generating


Function and Generating Inner Product

We have shown that “Information geometry” should be explored as a particular


application domain of Hessian geometry through Jean-Louis Koszul work (Koszul–
Vinberg metric deduced from the associated Characteristic function having the main
property to be invariant to all automorphisms of the convex cone).
Should we deduce that the “Eidos” or the “essence” of Information Geometry is
limited to “Koszul Characteristic Function”? This notion seems not to be the more
general one, and we would like to explore two new directions of research:
206 F. Barbaresco

Fig. 7.16 Gromov inner


product in homogeneous
bounded domains and its
Shilov boundary

• Generating Inner Product: In Koszul Geometry, we have two convex dual func-
tions Ω(x) and Ω⊂ (x ⊂ ) with dual system of coordinates x and x ⊂ defined on dual
cones Φ and Φ⊂ : Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ and Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈−Ω(x).
Φ⊂
We can then remark that if we can define an Inner Product ∗., .∈, we will be able to
build convex characteristic function and its dual by Legendre transform because
both are only dependent of the Inner product , and
 dual−∗Γ,x∈
coordinate is also defined by
x ⊂ = arg min {ΔΦ (y)/y ≡ Φ⊂ , ∗x, y∈ = n} = Γ.e dΓ / e−∗Γ,x∈ dΓ where
Φ⊂ Φ⊂
x ⊂ is also the center of gravity of the cross section {y ≡ Φ⊂ , ∗x, y∈ = n} of Φ⊂
(with notation: Ω(x) = − log ΔΦ (x)).

It is not possible to define an inner product for any two elements of a Lie Algebra,
but a symmetric bilinear form, called “Cartan–Killing form”, could be introduced.
This form has been introduced first by Elie Cartan in 1894 in his PhD report. This
form is defined according to the adjoint endomorphism Adx of g that is defined for
every element x of g with the help of the Lie bracket:
⎨ ⎩
Adx (y) = x, y (7.235)

The trace of the composition of two such endomorphisms defines a bilinear form,
the Cartan–Killing form:  
B(x, y) = Tr Adx Ady (7.236)

The Cartan–Killing form is symmetric:

B(x, y) = B(y, x) (7.237)

and has the associativity property:


7 Eidetic Reduction of Information Geometry 207
⎨ ⎩   ⎨ ⎩
B x, y , z = B x, y, z (7.238)

given by:
⎨ ⎩   ⎟ ⎨ ⎩ 
B x, y , z = Tr Ad[x,y] Adz = Tr Adx , Ady Adz
 ⎨ ⎩  ⎨ ⎩
= Tr Adx Ady , Adz = B x, y, z

Elie Cartan has proved that if g is a simple Lie algebra (the Killing form is non-
degenerate) then any invariant symmetric bilinear form on g is a scalar multiple of
the Cartan–Killing form. The Cartan–Killing form is invariant under automor-
phisms σ ≡ Aut(g) of the algebra g:

B (σ (x), σ (y)) = B (x, y) (7.239)

To prove this invariance, we have to consider:


( ⎨ ⎩ ⎨ ⎩ $ %
σ x, y = σ (x), σ (y)
⇒ σ x, σ −1 (z) = [σ (x), z]
z = σ (y)

rewritten
Adσ (x) = σ ≤ Adx ≤ σ −1

Then
   ⎟
B (σ (x), σ (y)) = Tr Adσ (x) Adσ (y) = Tr σ ≤ Adx Ady ≤ σ −1
 
= Tr Adx Ady = B(x, y)

A natural G-invariant inner product could be then introduced by Cartan–Killing


form:
∗x, y∈ = −B (x, θ (y)) (7.240)

where θ ≡ g is a Cartan involution (An involution on g is a Lie algebra automor-


phism θ of g whose square is equal to the identity).

As other generalization of inner product, we can also consider the case of CAT(-
1)-space (generalization of simply connected Riemannian Manifold of negative
curvature lower than unity) or of an Homogeneous Symmetric Bounded domains,
and then define a “generating” Gromov Inner Product between three points x, y
and z (relatively to x) that is defined by the distance [118]:

1
∗y, z∈x = (d(x, y) + d(x, z) − d(y, z)) (7.241)
2
208 F. Barbaresco

with d(.,.) the distance in CAT(-1). Intuitively, this inner product measures the
distance of x to the geodesics between y to z.
This Inner product could be also defined for points on the Shilov Boundary of the
domain through Busemann distance:

 1  
Γ, Γ ≥ =
BΓ (x, p) + BΓ ≥ (x, p)
x
(7.242)
2
⎨ ⎩
Independent of p, Where BΓ (x, y) = Lim |x − r(t)| − |y − r(t)| is horospheric
t◦+→
distance, from x to y relatively to Γ , with r(t) geodesic ray. We have the property
that:    
Γ, Γ ≥ x = Lim y, y≥ x (7.243)
y◦Γ
y≥ ◦ Γ ≥

We can then define a visual metric on the Shilov boundary by (Fig. 7.16);
 
dx Γ, Γ ≥  = e−∗Γ,Γ ∈x if Γ = Γ ≥

≥ (7.244)
dx Γ, Γ = 0 otherwise

We can then define the characteristic function according to the origin 0:


⎞ ⎞
e−∗x,γ ∈0 dγ or ΩΦ (x) = − log
1
Ω(x) = − log e− 2 (d(0,x)+d(0,γ )−d(x,γ )) dγ
Φ⊂ Φ⊂
(7.245)
and
  1 
Ω⊂ (x ⊂ ) = x, x ⊂ 0 − Ω(x) = d(0, x) + d(0, x ⊂ ) − d(x, x ⊂ ) − Ω(x) (7.246)
2
 
d(x, x ⊂ ) = d(0, x ⊂ ) − 2Ω⊂ (x ⊂ ) + (d(0, x) − 2Ω(x)) (7.247)

with the center of gravity


⎞ ⎞
8
x⊂ = γ · e−∗x,γ ∈0 dγ e−∗x,γ ∈0 dγ (7.248)
Φ⊂ Φ⊂

All this relations are also true on the Shilov Boundary:


⎞ ⎞
 
e−∗Γ,Γ ∈0 dΓ ≥ = − log

Ω(Γ ) = − log d0 Γ, Γ ≥ · dΓ ≥ (7.249)
∂Φ⊂ ∂Φ⊂

where
7 Eidetic Reduction of Information Geometry 209

 
d0 Γ, Γ ≥ · dΓ ≥ (7.250)
∂Φ⊂

is the functional of Busemann barycenter on the Shilov Boundary ∂Φ⊂ (exis-


tence and unicity of this barycenter have been proved by Elie Cartan for Cartan–
Hadamard Spaces) [123, 124].
• Generating Functions: Characteristic function is a particular case of “generat-
ing function”. In Mathematics, first idea to represent Lagrange manifolds through
their generating functions was given by Hörmander [121] and then was general-
ized by Claude Viterbo [119, 120, 122]. A generating function for the Lagrange
submanifold L is a function G from E to R defined on a vector bundle E, and such
that: (⎪  2
∂G ∂G
L= x, / =0 (7.251)
∂x ∂η

η is in the fibre. We have then to ,study in the framework of Information Geome-


try, families of generating functions more general than potential functions deduced
from characteristic function.

In future work, it could be interesting to explore new direction related to polarized


surface in the framework developed by Donaldson, Guillemin and Abreu, in which
invariant Kähler metrics correspond to convex functions on the moment polytope of
a toric variety [170, 171, 173–175] based on precursor work of Atiyah and Bott [172]
on moment map and its convexity by Bruguières [181], Condevaux [182], Delzant
[183], Guillemin and Sternberg [184, 185] and Kirwan [186]. Recently, Mikhail
Kapranov has given a thermodynamical interpretation of the moment map for toric
varieties [200].
We hope that this tutorial paper and new results will help readers to have s synthetic
view of different domains of Science where Koszul hessian metric manifold is the
essence emerging from Information Geometry Theory.
Das ist fast so, als ob ich Bergson wäre
Edmund Husserl (1917)
(after reading PhD first chapter of his student Roman Ingarden dedicated to Bergson
Philosophy)
Création signifie, avant tout émotion. C’est alors seulement que l’esprit se sent
ou se croit créateur. Il ne part plus d’une multiplicité d’éléments tout faits pour
aboutir à une unité composite où il y aura un nouvel arrangement de l’ancien. Il
s’est transporté tout d’un coup à quelque chose qui paraît à la fois un et unique, qui
cherchera ensuite à s’étaler tant bien que mal en concepts multiples et communs,
donnés d’avance dans des mots.
Henri Bergson
In “Les deux sources de la morale et de la religion”
210 F. Barbaresco

References

1. Koszul, J.L.: Variétés localement plates et convexité. Osaka J. Math. 2, 285–290 (1965)
2. Vey, J.: Sur les automorphismes affines des ouverts convexes saillants. Annali della Scuola
Normale Superiore di Pisa, Classe di Science, 3e série, Tome 24(4), 641–665 (1970)
3. Massieu, F.: Sur les fonctions caractéristiques des divers fluides. C. R. Acad. Sci. 69, 858–862
(1869)
4. Massieu, F.: Addition au précédent Mémoire sur les fonctions caractéristiques. C. R. Acad.
Sci. 69, 1057–1061 (1869)
5. Massieu, F.: Thermodynamique: mémoire sur les fonctions caractéristiques des divers fluides
et sur la théorie des vapeurs, 92 p. Académie des Sciences (1876)
6. Duhem, P.: Sur les équations générales de la thermodynamique. Annales Scientifiques de
l’Ecole Normale Supérieure, 3e série, Tome 8, 231 (1891)
7. Duhem, P.: Commentaire aux principes de la thermodynamique. Première partie, Journal de
Mathématiques pures et appliquées, 4e série, Tome 8, 269 (1892)
8. Duhem, P.: Commentaire aux principes de la thermodynamique—troisième partie. Journal de
Mathématiques pures et appliquées, 4e série, Tome 10, 203 (1894)
9. Duhem, P.: Les théories de la chaleur. Duhem 1992, 351–1 (1895)
10. Laplace, P.S.: Mémoire sur la probabilité des causes sur les évènements. Mémoires de Math-
ématique et de Physique, Tome Sixième (1774)
11. Arnold, V.I., Givental, A.G.: Symplectic geometry. In: Encyclopedia of Mathematical Science,
vol. 4. Springer, New York (translated from Russian) (2001)
12. Fitzpatrick, S.: On the geometric quantization of contact manifolds. http://arxiv.org/abs/0909.
2023v3 (2013). Accessed Feb 2013
13. Rajeev, S.G.: Quantization of contact manifolds and thermodynamics. Ann. Phys. 323(3),
768–82 (2008)
14. Gibbs, J.W.: Graphical methods in the thermodynamics of fluids. In: Bumstead, H.A., Van
Name, R.G. (eds.) Scientific Papers of J Willard Gibbs, 2 vols. Dover, New York (1961)
15. Dedecker, P.: A property of differential forms in the calculus of variations. Pac. J. Math. 7(4),
1545–9 (1957)
16. Lepage, T.: Sur les champs géodésiques du calcul des variations. Bull. Acad. Roy. Belg. Cl.
Sci. 27, 716–729, 1036–1046 (1936)
17. Mrugala, R.: On contact and metric structures on thermodynamic spaces. RIMS Kokyuroku
1142, 167–81 (2000)
18. Ingarden R.S., Kossakowski A.: The poisson probability distribution and information ther-
modynamics. Bull. Acad. Pol. Sci. Sér. Sci. Math. Astron. Phys. 19, 83–85 (1971)
19. Ingarden, R.S.: Information geometry in functional spaces of classical and quantum finite
statistical systems. Int. J. Eng. Sci. 19(12), 1609–33 (1981)
20. Ingarden, R.S., Janyszek, H.: On the local Riemannian structure of the state space of classical
information thermodynamics. Tensor, New Ser. 39, 279–85 (1982)
21. Ingarden, R.S., Kawaguchi, M., Sato, Y.: Information geometry of classical thermodynamical
systems. Tensor, New Ser. 39, 267–78 (1982)
22. Ingarden R.S.: Information geometry of thermodynamics. In: Transactions of the Tenth Prague
Conference Czechoslovak Academy of Sciences, vol. 10A–B, pp. 421–428 (1987)
23. Ingarden, R.S.: Information geometry of thermodynamics, information theory, statistical
decision functions, random processes. In: Transactions of the 10th Prague Conference,
Prague/Czechoslovakia 1986, vol. A, pp. 421–428 (1988)
24. Ingarden, R.S., Nakagomi, T.: The second order extension of the Gibbs state. Open Syst. Inf.
Dyn. 1(2), 243–58 (1992)
25. Arnold V.I.: Contact geometry: the geometrical method of Gibbs’s thermodynamics. In: Pro-
ceedings of the Gibbs Symposium, pp. 163–179. American Mathematical Society, Providence,
RI (1990)
26. Cartan, E.: Leçons sur les Invariants Intégraux. Hermann, Paris (1922)
7 Eidetic Reduction of Information Geometry 211

27. Koszul J.L.: Exposés sur les espaces homogènes symétriques. Publicação da Sociedade de
Matematica de São Paulo (1959)
28. Koszul J.L.: Sur la forme hermitienne canonique des espaces homogènes complexes. Can. J.
Math. 7(4), 562–576 (1955)
29. Koszul, J.L.: Lectures on Groups of Transformations. Tata Institute of Fundamental Research,
Bombay (1965)
30. Koszul, J.L.: Domaines bornées homogènes et orbites de groupes de transformations affines.
Bull. Soc. Math. Fr. 89, 515–33 (1961)
31. Koszul, J.L.: Ouverts convexes homogènes des espaces affines. Math. Z. 79, 254–9 (1962)
32. Koszul, J.L.: Déformations des variétés localement plates. Ann. Inst. Fourier 18, 103–14
(1968)
33. Vinberg, E.: Homogeneous convex cones. Trans. Moscow Math. Soc. 12, 340–363 (1963)
34. Vesentini E.: Geometry of Homogeneous Bounded Domains. Springer, Berlin (2011). Reprint
of the 1st Edn. C.I.M.E., Ed. Cremonese, Roma (1968)
35. Barbaresco F.: Information geometry of covariance matrix: Cartan-Siegel homogeneous
bounded domains, Mostow/Berger fibration and Fréchet median. In: Bhatia, R., Nielsen,
F. (eds.) Matrix Information Geometry, pp. 199–256. Springer, New York (2012)
36. Arnaudon M., Barbaresco F., Le, Y.: Riemannian medians and means with applications to
radar signal processing. IEEE J. Sel. Top. Sig. Process. 7(4), 595–604 (2013)
37. Dorfmeister, J.: Inductive construction of homogeneous cones. Trans. Am. Math. Soc. 252,
321–49 (1979)
38. Dorfmeister, J.: Homogeneous siegel domains. Nagoya Math. J. 86, 39–83 (1982)
39. Poincaré, H.: Thermodynamique, Cours de Physique Mathématique. G. Carré, Paris (1892)
40. Poincaré, H.: Calcul des Probabilités. Gauthier-Villars, Paris (1896)
41. Faraut, J., Koranyi, A.: Analysis on Symmetric Cones. The Clarendon Press, New York (1994)
42. Faraut, J., Koranyi, A.: Oxford Mathematical Monographs. Oxford University Press, New
York (1994)
43. Varadhan, S.R.S.: Asymptotic probability and differential equations. Commun. Pure Appl.
Math. 19, 261–86 (1966)
44. Sanov, I.N.: On the probability of large deviations of random magnitudes. Mat. Sb. 42(84),
11–44 (1957)
45. Ellis, R.S.: The Theory of Large Deviations and Applications to Statistical Mechanics. Lecture
Notes for Ecole de Physique Les Houches, France (2009)
46. Touchette, H.: The large deviation approach to statistical mechanics. Phys. Rep. 478(1—-3),
1–69 (2009)
47. Cartan, E.: Sur les domaines bornés de l’espace de n variables complexes. Abh. Math. Semin.
Hamburg 1, 116–62 (1935)
48. Lichnerowicz, A.: Espaces homogènes Kähleriens. In: Collection Géométrie Différentielle,
pp. 171–84, Strasbourg (1953)
49. Sasaki T.: A note on characteristic functions and projectively invariant metrics on a bounded
convex domain. Tokyo J. Math. 8(1), 49–79 (1985)
50. Sasaki, T.: Hyperbolic affine hyperspheres. Nagoya Math. J. 77, 107–23 (1980)
51. Trench W.F.: An algorithm for the inversion of finite Toeplitz matrices. J. Soc. Ind. Appl.
Math. 12, 515–522 (1964)
52. Verblunsky, S.: On positive harmonic functions. Proc. London Math. Soc. 38, 125–57 (1935)
53. Verblunsky, S.: On positive harmonic functions. Proc. London Math. Soc. 40, 290–20 (1936)
54. Hauser, R.A., Güler, O.: Self-scaled barrier functions on symmetric cones and their classifi-
cation. Found. Comput. Math. 2(2), 121–43 (2002)
55. Vinberg, E.B.: Structure of the group of automorphisms of a homogeneous convex cone. Tr.
Mosk. Mat. O-va 13, 56–83 (1965)
56. Siegel, C.L.: Über der analytische theorie der quadratischen Formen. Ann. Math. 36, 527–606
(1935)
57. Duan, X., Sun, H., Peng, L.: Riemannian means on special euclidean group and unipotent
matrices group. Sci. World J. 2013, ID 292787 (2013)
212 F. Barbaresco

58. Soize C.: A nonparametric model of random uncertainties for reduced matrix models in
structural dynamics. Probab. Eng. Mech. 15(3), 277–294 (2000)
59. Bennequin, D.: Dualités de champs et de cordes. Séminaire N. Bourbaki, exp. no. 899, pp.
117–148 (2001–2002)
60. Bennequin D.: Dualité Physique-Géométrie et Arithmétique, Brasilia (2012)
61. Chasles M.: Aperçu historique sur l’origine et le développement des méthodes en géométrie
(1837)
62. Gergonne, J.D.: Polémique mathématique. Réclamation de M. le capitaine Poncelet (extraite
du bulletin universel des annonces et nouvelles scientifiques); avec des notes. Annales de
Gergonne, vol. 18, pp. 125–125. http://www.numdam.org (1827–1828)
63. Poncelet, J.V.: Traité des propriétés projectives des figures (1822)
64. André, Y.: Dualités. Sixième séance, ENS, Mai (2008)
65. Atiyah, M.F.: Duality in mathematics and physics, lecture Riemann’s influence in geometry.
Analysis and Number Theory at the Facultat de Matematiques i Estadıstica of the Universitat
Politecnica de Catalunya (2007)
66. Von Oettingen, A.J.: Harmoniesystem in dualer Entwicklung. Studien zur Theorie der Musik,
Dorpat und Leipzig (1866)
67. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
1 (1902)
68. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
2, pp. 62–75 (1903/1904)
69. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
3, pp. 375–403 (1904)
70. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
4, pp. 241–269 (1905)
71. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
5, pp. 116–152, 301–338, 449–503 (1906)
72. Von Oettingen, A.J.: Das duale Harmoniesystem, Leipzig (1913)
73. Von Oettingen, A.J.: Die Grundlagen der musikwissenschaft und das duale reinistrument. In:
Abhand-lungen der mathematisch-physikalischen Klasse der Königlich Sächsischen Gesell-
schaft der Wissenschaften, vol. 34, pp. S.I–XVI, 155–361 (1917)
74. D’Alembert, J.R.: Éléments de musique, théorique et pratique, suivant les principes de M.
Rameau, Paris (1752)
75. Rameau, J.P.: Traité de l’harmonie Réduite à ses Principes Naturels. Ballard, Paris (1722)
76. Rameau, J.P.: Nouveau système de Musique Théorique. Ballard, Paris (1726)
77. Yang, L.: Médianes de mesures de probabilité dans les variétés riemanniennes et applications
à la détection de cibles radar. Thèse de l’Université de Poitiers, tel-00664188, 2011, Thales
PhD Award (2012)
78. Barbaresco, F.: Algorithme de Burg Régularisé FSDS. Comparaison avec l’algorithme de
Burg MFE, pp. 29–32 GRETSI conference (1995)
79. Barbaresco, F.: Information geometry of covariance matrix. In: Nielsen, F., Bhatia, R. (eds.)
Matrix Information Geometry Book. Springer, Berlin (2012)
80. Émery, M., Mokobodzki, G.: Sur le barycentre d’une probabilité dans une variété. Séminaire
de probabilité Strasbourg 25, 220–233 (1991)
81. Friedrich, T.: Die Fisher-information und symplektische strukturen. Math. Nachr. 153, 273–96
(1991)
82. Bingham N.H.: Szegö’s Theorem and Its Probabilistic Descendants. http://arxiv.org/abs/1108.
0368v2 (2012)
83. Landau, H.J.: Maximum entropy and the moment problem. Bull. Am. Math. Soc. 16(1), 47–77
(1987)
84. Siegel, C.L.: Symplectic geometry. Am. J. Math. 65, 1–86 (1943)
85. Libermann, P., Marle, C.M.: Symplectic Geometry and Analytical Mechanics. Reidel, Dor-
drecht (1987)
7 Eidetic Reduction of Information Geometry 213

86. Delsarte, P., Genin, Y.V.: Orthogonal polynomial matrices on the unit circle. IEEE Trans.
Comput. Soc. 25(3), 149–160 (1978)
87. Kanhouche, R.: A modified burg algorithm equivalent. In: Results to Levinson algorithm.
http://hal.archives-ouvertes.fr/ccsd-00000624
88. Douady, C.J., Earle, C.J.: Conformally natural extension of homeomorphisms of circle. Acta
Math. 157, 23–48 (1986)
89. Balian, R.: A metric for quantum states issued from von Neumann’s entropy. In: Nielsen, F.,
Barbaresco, F. (Eds.) Geometric Science of Information. Lecture Notes in Computer Science,
vol. 8085, pp. 513–518
90. Balian, R.: Incomplete descriptions and relevant entropies. Am. J. Phys. 67, 1078–90 (1989)
91. Balian, R.: Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 36, 323–353
(2005)
92. Allahverdyan, A., Balian, R., Nieuwenhuizen, T.: Understanding quantum measurement from
the solution of dynamical models. Phys. Rep. 525, 1–166 (2013) (ArXiv: 1107, 2138)
93. Balian, R.: From Microphysics to Macrophysics: Methods and Applications of Statistical
Physics, vol. 1–2. Springer (2007)
94. Balian, R., Balazs, N.: Equiprobability, information and entropy in quantum theory. Ann.
Phys. (NY) 179, 97–144 (1987)
95. Balian, R., Alhassid, Y., Reinhardt, H.: Dissipation in many-body systems: a geometric
approach based on information theory. Phys. Rep. 131, 1–146 (1986)
96. Barbaresco F.: Information/contact geometries and Koszul entropy. In: Nielsen, F., Bar-
baresco, F. (eds.) Geometric Science of Information. Lecture Notes in Computer Science,
vol. 8085, pp. 604–611. Springer, Berlin (2013)
97. Souriau J.M.: Définition covariante des équilibres thermodynamiques. Suppl. Nuovo Cimento
1, 203–216 (1966)
98. Souriau, J.M.: Thermodynamique et Géométrie 676, 369–397 (1978)
99. Souriau, J.M.: Géométrie de l’espace de phases. Commun. Math. Phys. 1, 374 (1966)
100. Souriau, J.M.: On geometric mechanics. Discrete Continuous Dyn. Syst. 19(3), 595–607
(2007)
101. Souriau, J.M.: Structure des Systèmes Dynamiques. Dunod, Paris (1970)
102. Souriau, J.M.: Structure of dynamical systems. Progress in Mathematics, vol. 149. Birkhäuser
Boston Inc., Boston. A symplectic view of physics (translated from the French by Cushman-de
Vries, C.H.) (1997)
103. Souriau, J.M.: Thermodynamique relativiste des fluides. Rend. Sem. Mat. Univ. e Politec.
Torino, 35:21–34 (1978), 1976/77
104. Souriau, J.M., Iglesias, P.: Heat cold and geometry. In: Cahen, M., et al. (eds.) Differential
Geometry and Mathematical Physics, pp. 37–68 (1983)
105. Souriau, J.M.: Thermodynamique et géométrie. In: Differential Geometrical Methods in Math-
ematical Physics, vol. 2 (Proceedings of the International Conference, University of Bonn,
Bonn, 1977). Lecture Notes in Mathematics, vol. 676, pp. 369–397. Springer, Berlin (1978)
106. Souriau, J.M.,: Dynamic systems structure (Chap. 16 Convexité, Chap. 17 Mesures, Chap.
18 Etats Statistiques, Chap. 19 Thermodynamique), unpublished technical notes, available in
Souriau archive (document sent by Vallée, C.)
107. Vallée, C.: Lois de comportement des milieux continus dissipatifs compatibles avec la
physique relativiste, thèse, Poitiers University (1978)
108. Iglésias P., Equilibre statistiques et géométrie symplectique en relativité générale. Ann. l’Inst.
Henri Poincaré, Sect. A, Tome 36(3), 257–270 (1982)
109. Iglésias, P.: Essai de thermodynamique rationnelle des milieux continus. Ann. l’Inst. Henri
Poincaré, 34, 1–24 (1981)
110. Vallée, C.: Relativistic thermodynamics of continua. Int. J. Eng. Sci. 19(5), 589–601 (1981)
111. Pavlov, V.P., Sergeev, V.M.: Thermodynamics from the differential geometry standpoint.
Theor. Math. Phys. 157(1), 1484–1490 (2008)
112. Kozlov, V.V.: Heat Equilibrium by Gibbs and Poincaré. RKhD, Moscow (2002)
214 F. Barbaresco

113. Berezin, F.A.: Lectures on Statistical Physics. Nauka, Moscow (2007) (English trans., World
Scientific, Singapore, 2007)
114. Poincaré, H.: Réflexions sur la théorie cinétique des gaz. J. Phys. Theor. Appl. 5, 369–403
(1906)
115. Carathéodory, C.: Math. Ann. 67, 355–386 (1909)
116. Nakajima, S.: On quantum theory of transport phenomena. Prog. Theor. Phys. 20(6), 948–959
(1958)
117. Zwanzig, R.: Ensemble method in the theory of irreversibility. J. Chem. Phys. 33(5), 1338–
1341 (1960)
118. Bourdon, M.: Structure conforme au bord et flot géodésique d’un CAT(-1)-espace.
L’Enseignement Math. 41, 63–102 (1995)
119. Viterbo, C.: Generating functions, symplectic geometry and applications. In: Proceedings of
the International Congress Mathematics, Zurich (1994)
120. Viterbo, C.: Symplectic topology as the geometry of generating functions. Math. Ann. 292,
685–710 (1992)
121. Hörmander, L.: Fourier integral operators I. Acta Math. 127, 79–183 (1971)
122. Théret, D.: A complete proof of Viterbo’s uniqueness theorem on generating functions. Topol-
ogy Appl. 96, 249–266 (1999)
123. Pansu, P.: Volume, courbure et entropie. Séminaire Bourbaki 823, 83–103 (1996)
124. Besson, G., Courtois, G., Gallot, S.: Entropies et rigidités des espaces localement symétriques
de courbure strictement négative. Geom. Funct. Anal. 5, 731–799 (1995)
125. Fréchet, M.: Sur quelques points du calcul fonctionnel. Rend. Circolo Math. Palermo 22, 1–74
(1906)
126. Fréchet, M.: L’espace des courbes n’est qu’un semi-espace de Banach. General Topology and
Its Relation to Modern Analysis and Algebra, pp. 155–156, Prague (1962)
127. Alt, H., Godau, M.: Computing the Fréchet distance between two polygonal curves. Int. J.
Comput. Geom. Appl. 5, 75–91 (1995)
128. Fréchet, M.R.: Les éléments aléatoires de nature quelconque dans un espace distancié. Ann.
l’Inst. Henri Poincaré 10(4), 215–310 (1948)
129. Marle, C.M.: On mechanical systems with a Lie group as configuration space. In: M. de Gosson
(ed.) Jean Leray ’99 Conference Proceedings: the Karlskrona Conference in the Honor of Jean
Leray, Kluwer, Dordrecht, pp. 183–203 (2003)
130. Marle, C.M.: On Henri Poincaré’s note “Sur une forme nouvelle des équations de la
mécanique”. JGSP 29, 1–38 (2013)
131. Kloeckner, B.: Géométrie des bords: compactifications différentiables et remplissages
holomorphes. Thèse Ecole Normale Supérieure de Lyon. http://tel.archives-ouvertes.fr/tel-
00120345 (2006). Accessed Dec 2006
132. Barbaresco, F.: Super Resolution Spectrum Analysis Regularization: Burg, Capon and Ago-
antagonistic Algorithms, EUSIPCO-96, pp. 2005–2008, Trieste (1996)
133. Barbaresco, F.: Computation of most threatening radar trajectories areas and corridors based
on fast-marching and level sets. In: IEEE CISDA Symposium, Paris (2011)
134. Michor, P.W., Mumford, D.: An overview of the Riemannian metrics on spaces of curves
using the Hamiltonnian approach. Appl. Comput. Harm. Anal. 23(1), 74–113 (2007)
135. Chouakria-Douzal, A., Nagabhusha, P.N.: Improved Fréchet distance for time series. In: Data
Sciences and Classification, pp. 13–20. Springer, Berlin (2006)
136. Bauer, M., et al.: Constructing reparametrization invariant metrics on spaces of plane curves.
Preprint http://arxiv.org/abs/1207.5965
137. Fréchet, M.: L’espace dont chaque élément est une courbe n’est qu’un semi-espace de Banach.
Ann. Sci. l’ENS, 3ème série. Tome 78(3), 241–272 (1961)
138. Fréchet, M.: L’espace dont chaque élément est une courbe n’est qu’un semi-espace de Banach
II. Ann.Sci. l’ENS, 3ème série. Tome 80(2), pp. 135–137 (1963)
139. Chazal, F., et al.: Gromov-Hausdorff stable signatures for shapes using persistence. In: Euro-
graphics Symposium on Geometry Processing 2009, Marc Alexa and Michael Kazhdan (Guest
editors), vol. 28, no. 5 (2009)
7 Eidetic Reduction of Information Geometry 215

140. Cagliari, F., Di Fabio B., Landi, C.: The natural pseudo-distance as a quotient pseudo metric,
and applications. Preprint http://amsacta.unibo.it/3499/1/Forum_Submission.pdf
141. Frosini, P., Landi, C.: No embedding of the automorphisms of a topological space into a
compact metric space endows them with a composition that passes to the limit. Appl. Math.
Lett. 24(10), 1654–1657 (2011)
142. Shevchenko, O.: Recursive construction of optimal self-concordant barriers for homogeneous
cones. J. Optim. Theor. Appl. 140(2), 339–354 (2009)
143. Güler, O., Tunçel, L.: Characterization of the barrier parameter of homogeneous convex cones.
Math. Program. 81(1), Ser. A, 55–76 (1998)
144. Rothaus, O.S.: Domains of positivity. Bull. Am. Math. Soc. 64, 85–86 (1958)
145. Nesterov, Y., Nemirovskii, A.: Interior-point polynomial algorithms. In: Convex Program-
ming, SIAM Studies in Applied Mathematics, vol. 13 (1994)
146. Vinberg, E.B.: The theory of homogeneous convex cones. Tr. Mosk. Mat. O-va. 12, 303–358
(1963)
147. Rothaus, O.S.: The construction of homogeneous convex cones. Ann. Math. Ser. 2, 83, 358–
376 (1966)
148. Güler, O.: Barrier functions in interior point methods. Math. Oper. Res. 21(4), 860–885 (1996)
149. Uehlein, F.A.: Eidos and Eidetic Variation in Husserl’s Phenomenology. In: Language and
Schizophrenia, Phenomenology, pp. 88–102. Springer, New York (1992)
150. Bergson, H.: L’évolution créatrice. Les Presses universitaires de France, Paris (1907). http://
classiques.uqac.ca/classiques/bergson_henri/evolution_creatrice/evolution_creatrice.pdf
151. Riquier, C.: Bergson lecteur de Platon: le temps et l’eidos, dans interprétations des idées
platoniciennes dans la philosophie contemporaine (1850–1950), coll. Tradition de la pensée
classique, Paris, Vrin (2011)
152. Worms, F.: Bergson entre Russel et Husserl: un troisième terme? In: Rue Descartes, no. 29,
Sens et phénomène, philosophie analytique et phénoménologie, pp. 79–96, Presses Univer-
sitaires de France, Sept. 2000
153. Worms, F.: Le moment 1900 en philosophie. Presses Universitaires du Septentrion, premier
trimestre, Etudes réunies sous la direction de Frédéric Worms (2004)
154. Worms, F.: Bergson ou Les deux sens de la vie: étude inédite, Paris, Presses universitaires de
France, Quadrige. Essais, débats (2004)
155. Bergson, H., Poincaré, H.: Le matérialisme actuel. Bibliothèque de Philosophie Scientifique,
Paris, Flammarion (1920)
156. de Saxcé G., Vallée C.: Bargmann group, momentum tensor and Galilean invariance of
Clausius-Duhem inequality. Int. J. Eng. Sci. 50, 216–232 (2012)
157. Dubois, F.: Conservation laws invariants for Galileo group. CEMRACS preliminary results.
ESAIM Proc. 10, 233–266 (2001)
158. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R.
l’Acad. des Sci. Série A, Tome 255, 2897–2899 (1962)
159. Nielsen, F.: Hypothesis testing, information divergence and computational geometry. In:
GSI’13 Conference, Paris, pp. 241–248 (2013)
160. Vey, J.: Sur une notion d’hyperbolicité des variables localement plates. Faculté des sciences
de l’université de Grenoble, Thèse de troisième cycle de mathématiques pures (1969)
161. Ruelle, D.: Statistical mechanics. In: Rigorous Results (Reprint of the 1989 edition). World
Scientific Publishing Co., Inc, River Edge. Imperial College Press, London (1999)
162. Ruelle, D.: Hasard et Chaos. Editions Odile Jacob, Aout (1991)
163. Shima, H.: Geometry of Hessian Structures. In: Nielsen, F., Barbaresco, F. (eds.) Lecture
Notes in Computer Science, vol. 8085, pp. 37–55. Springer, Berlin (2013)
164. Shima, H.: The Geometry of Hessian Structures. World Scientific, London (2007)
165. Zia, R.K.P., Redish Edward F., McKay Susan, R.: Making Sense of the Legendre Transform
(2009), arXiv:0806.1147, June 2008
166. Fréchet, M.: Sur l’écart de deux courbes et sur les courbes limites. Trans. Am. Math. Soc.
6(4), 435–449 (1905)
216 F. Barbaresco

167. Taylor, A.E., Dugac, P.: Quatre lettres de Lebesgue à Fréchet. Rev. d’Hist. Sci. Tome 34(2),
149–169 (1981)
168. Jensen, J.L.W.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta
Math. 30(1), 175–193 (1906)
169. Needham, T.: A visual explanation of Jensen’s inequality. Am. Math. Mon. 8, 768–77 (1993)
170. Donaldson, S.K.: Scalar curvature and stability of toric variety. J. Differ. Geom. 62, 289–349
(2002)
171. Abreu, M.: Kähler geometry of toric varieties and extremal metrics. Int. J. Math. 9, 641–651
(1998)
172. Atiyah, M., Bott, R.: The moment map and equivariant cohomology. Topology 23, 1–28
(1984)
173. Guan, D.: On modified Mabuchi functional and Mabuchi moduli space of kahler metrics on
toric bundles. Math. Res. Lett. 6, 547–555 (1999)
174. Guillemin, V.: Kaehler structures on toric varieties. J. Differ. Geom. 40, 285–309 (1994)
175. Guillemin, V.: Moment maps and combinatorial invariants of Hamiltonian Tn -spaces,
Birkhauser (1994)
176. Crouzeix, J.P.: A relationship between the second derivatives of a convex function and of its
conjugate. Math. Program. 3, 364–365 (1977) (North-Holland)
177. Seeger, A.: Second derivative of a convex function and of its Legendre-Fenchel transformate.
SIAM J. Optim. 2(3), 405–424 (1992)
178. Hiriart-Urruty, J.B.: A new set-valued second-order derivative for convex functions. Mathe-
matics for Optimization, Mathematical Studies, vol. 129. North Holland, Amsterdam (1986)
179. Berezin, F.: Lectures on Statistical Physics (Preprint 157). Max-Plank-Institut für Mathematik,
Bonn (2006)
180. Hill, R., Rice, J.R.: Elastic potentials and the structure of inelastic constitutive laws. SIAM J.
Appl. Math. 25(3), 448–461 (1973)
181. Bruguières, A.: Propriétés de convexité de l’application moment, séminaire N. Bourbaki, exp.
no. 654, pp. 63–87 (1985–1986)
182. Condevaux, M., Dazord, P., Molino, P.: Géométrie du moment. Trav. Sémin. Sud-Rhodanien
Géom. Univ. Lyon 1, 131–160 (1988)
183. Delzant, T.: Hamiltoniens périodiques et images convexes de l’application moment. Bull. Soc.
Math. Fr. 116, 315–339 (1988)
184. Guillemin, V., Sternberg, S.: Convexity properties of the moment mapping. Inv. Math. 67,
491–513 (1982)
185. Guillemin, V., Sternberg, S.: Convexity properties of the moment mapping. Inv. Math. 77,
533–546 (1984)
186. Kirwan, F.: Convexity properties of the moment mapping. Inv. Math. 77, 547–552 (1984)
187. Deza, E., Deza, M.M.: Dictionary of Distances. Elsevier, Amsterdam (2006)
188. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. II 106(4), 620–630
(1957)
189. Jaynes, E.T.: Information theory and statistical mechanics II. Phys. Rev. II 108(2), 171–190
(1957)
190. Jaynes, E.T.: Prior probabilities. IEEE Trans. Syst. Sci. Cybern. 4(3), 227–241 (1968)
191. Amari, S.I., Nagaoka, H.: Methods of Information Geometry (Translation of Mathematical
Monographs), vol. 191. AMS, Oxford University Press, Oxford (2000)
192. Amari, S.I.: Differential Geometrical Methods in Statistics. Lecture Notes in Statistics, vol.
28. Springer, Berlin (1985)
193. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters.
Bull. Calcutta Math. Soc. 37, 81–89 (1945)
194. Chentsov, N.N.: Statistical decision rules and optimal inferences. In: Transactions of Mathe-
matics Monograph, vol. 53. American Mathematical Society, Providence (1982) (Published
in Russian in 1972)
195. Trouvé, A., Younes, L.: Diffeomorphic matching in 1d: designing and minimizing matching
functionals. In: Vernon, D. (ed.) Proceedings of ECCV (2000)
7 Eidetic Reduction of Information Geometry 217

196. Trouvé, A., Younes, L.: On a class of optimal matching problems in 1 dimension. SIAM J.
Control Opt. 39(4), 1112–1135 (2001)
197. Younes, L.: Computable elastic distances between shapes. SIAM J. Appl. Math 58, 565–586
(1998)
198. Younes, L.: Optimal matching between shapes via elastic deformations. Image Vis. Comput.
17, 381–389 (1999)
199. Younes, L., Michor, P.W., Shah, J., Mumford, D.: A metric on shape space with explicit
geodesics. Rend. Lincei Mat. Appl. 9, 25–57 (2008)
200. Kapranov, M.: Thermodynamics and the Moment Map (preprint), arXiv:1108.3472, Aug 2011
Chapter 8
Distances on Spaces of High-Dimensional
Linear Stochastic Processes: A Survey

Bijan Afsari and René Vidal

Abstract In this paper we study the geometrization of certain spaces of stochastic


processes. Our main motivation comes from the problem of pattern recognition in
high-dimensional time-series data (e.g., video sequence classification and clustering).
In the first part of the paper, we provide a rather extensive review of some existing
approaches to defining distances on spaces of stochastic processes. The majority
of these distances are, in one way or another, based on comparing power spectral
densities of the processes. In the second part, we focus on the space of processes
generated by (stochastic) linear dynamical systems (LDSs) of fixed size and order,
for which we recently introduced a class of group action induced distances called
the alignment distances. This space is a natural choice in some pattern recognition
applications and is also of great interest in control theory, where it is often convenient
to represent LDSs in state-space form. In this case the space (more precisely manifold)
of LDSs can be considered as the base space of a principal fiber bundle comprised
of state-space realizations. This is due to a Lie group action symmetry present in the
state-space representation of LDSs. The basic idea behind the alignment distance is
to compare two LDSs by first aligning a pair of their realizations along the respective
fibers. Upon a standardization (or bundle reduction) step this alignment process can
be expressed as a minimization problem over orthogonal matrices, which can be
solved efficiently. The alignment distance differs from most existing distances in
that it is a structural or generative distance, since in some sense it compares how two
processes are generated. We also briefly discuss averaging LDSs using the alignment
distance via minimizing a sum of the squares of distances (namely, the so-called
Fréchet mean).

B. Afsari (B) · R. Vidal


Center for Imaging Science, Johns Hopkins University, Baltimore MD 21218, USA
e-mail: [email protected]
R. Vidal
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 219


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_8,
© Springer International Publishing Switzerland 2014
220 B. Afsari and R. Vidal

Keywords Stochastic processes · Pattern recognition · Linear dynamical systems ·


Extrinsic and intrinsic geometries · Principal fiber bundle · Generalized dynamic
factor model · Minimum phase · Spectral factorization · All-pass filter · Hellinger
distance · Itakura-Saito divergence · Fréchet mean

8.1 Introduction and Motivation

Pattern recognition (e.g., classification and clustering) of time-series data is important


in many real world data analysis problems. Early applications include the analysis of
one-dimensional data such as speech and seismic signals (see, e.g., [48] for a review).
More recently, applications in the analysis of video data (e.g., activity recognition
[1]), robotic surgery data (e.g., surgical skill assessment [12]), or biomedical data
(e.g., analysis of multichannel EEG signals) have motivated the development of
statistical techniques for the analysis of high-dimensional (or vectorial) time-series
data.
The problem of pattern recognition for time-series data, in its full generality, needs
tools from the theory of statistics on stochastic processes or function spaces. Thus
it bears relations with the general problem of inference on (infinite dimensional)
spaces of stochastic processes, which requires a quite sophisticated mathematical
theory [30, 59]. However, at the same time, the pattern recognition problem is more
complicated since, in general, it involves not only inference but also learning. Learn-
ing and inference on infinite dimensional spaces obviously can be daunting tasks. In
practice, there have been different grand strategies proposed to deal with this prob-
lem (e.g., see [48] for a review). In certain cases it is reasonable and advantageous
from both theoretical and computational points of view to simplify the problem
by assuming that the observed processes are generated by models from a specific
finite-dimensional class of models. In other words, one could follow a parametric
approach based on modeling the observed time series and then performing statisti-
cal analysis and inference on a finite dimensional space of models (instead of the
space of the observed raw data). In fact, in many real-world instances (e.g., video
sequences [1, 12, 22, 60] or econometrics [7, 20, 24]), one could model the observed
high-dimensional time series with low-order Linear Dynamical Systems (LDSs). In
such instances the mentioned strategy could prove beneficial, e.g., in terms of imple-
mentation (due to significant compression achieved in high dimensions), statistical
inference, and synthesis of time series. For 1-dimensional time-series data the suc-
cess of Linear Predictive Coding (i.e., auto-regressive (AR) modeling) modeling and
its derivatives in modeling speech signals is a paramount example [26, 49, 58]. These
motivations lead us to state the following prototype problem:
Problem 1 (Statistical analysis on spaces of LDSs) Let { yi }i=1
N be a collection of

p-dimensional time series indexed by time t. Assume that each time series yi =
{ yit }⊂
t=1 can be approximately modeled by a (stochastic) LDS Mi of output-input
size ( p, m) and order n 1 realized as

1 Typically in video analysis: p ∗ 1000–10000, m, n ∗ 10 (see e.g., [1, 12, 60]).


8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 221

x it = Ai x it−1 + Bi v t ,
 m,n, p = Rn×n × Rn×m × R p×n × R p×m
yit = Ci x it + Di v t , (Ai , Bi , Ci , Di ) ∈ SL
(8.1)

where v t is a common stimulus process (e.g., white Gaussian noise with identity
covariance)2 and where the realization Ri = (Ai , Bi , Ci , Di ) is learnt and assumed
to be known. The problem is to: (1) Choose an appropriate space S of LDSs containing
the learnt models {Mi }i=1
N , (2) geometrize S, i.e., equip it with an appropriate geom-

etry (e.g., define a distance on S), (3) develop tools (e.g., probability distributions,
averages or means, variance, PCA) to perform statistical analysis (e.g., classification
and clustering) in a computationally efficient manner.
The first question to ask is: why model the processes using the state-space model
(representation) (8.1)? Recall that processes have equivalent ARMA and state-space
representations. Moreover, model (8.1) is quite general and with n large enough it
can approximate a large class of processes. More importantly, state-space represen-
tations (especially in high dimensions) are often more suitable for parameter learning
or system identification. In important practical cases of interest such models con-
veniently yield more parsimonious parametrization than vectorial ARMA models
which suffer from the curse of dimensionality [24]. The curse of dimensionality in
ARMA models stems from the fact that for p-dimensional time series if p is very
large the number of parameters of an ARMA model is roughly proportional to p 2 ,
which could be much larger than the number of data samples available pT , where T
is the observation time period (note that the autoregressive coefficient matrices are
very large p × p matrices). However, in many situations encountered in real world
examples, state-space models are more effective in overcoming the curse of dimen-
sionality [20, 24]. The intuitive reason, as already alluded to, is that often (very)
high-dimensional time series can be well approximated as being generated by a low
order but high-dimensional dynamical system (which implies small n despite large
p in the model (8.1)). This can be attributed to the fact that the components of the
observed time series exhibit correlations (cross sectional correlation). Moreover, the
contaminating noises also show correlation across different components (see [20, 24]
for examples of exact and detailed assumptions and conditions to formalize these
intuitive facts). Therefore, overall the number of parameters in the state-space model
is small compared with p 2 and this is readily reflected in (or encoded by) the small
size of the dynamics matrix Ai and the thinness of the observation matrix Ci in (8.1).3

2 Note that in a different or more general setting the noise at the output could be a process w
t different
(independent) from the input noise v t . This does not cause major changes in our developments. Since
the output noise usually represents a perturbation which cannot be modeled, as far as Problem 1 is
concerned, one could usually assume that Di = 0.
3 Note that we are not implying that ARMA models are incapable of modeling such time series.

Rather the issue is that general or unrestricted ARMA models suffer from the curse of dimensionality
in the identification problem, and the parametrization of a restricted class of ARMA models with a
small number of parameters is complicated [20]. However, at the same time, by using state-space
models it is easier to overcome the curse of dimensionality and this approach naturally leads to
simple and effective identification algorithms [20, 22].
222 B. Afsari and R. Vidal

Also, in general, state-space models are more convenient for computational purposes
than vectorial ARMA models. For example, in the case of high-dimensional time
series most effective estimation methods are based on state-space domain system
identification rooted in control theory [7, 41, 51]. Nevertheless, it should be noted
that, in general, the identification of multi-input multi-output (MIMO) systems is a
subtle problem (see Sect. 8.4 and e.g., [11, 31, 32]). However, for the case where
p > n, there are efficient system identification algorithms available for finding the
state-space parameters [20, 22].
Notice that in Problem 1 we are assuming that all the LDSs have the same order
n (more precisely the minimal order, see Sect. 8.3.3.1). Such an assumption might
seem rather restrictive and a more realistic assumption might be that all systems be
of order not larger than n (see Sect. 8.5.1). Note that since in practice real data can be
only approximately modeled by an LDS of fixed order, if n is not chosen too large,
then gross over-fitting of n is less likely to happen. From a practical point of view
(e.g., implementation) fixing the order for all systems results in great simplification in
implementation. Moreover, in classification or clustering problems one might need
to combine (e.g., average) such LDSs for the goal of replacing a class of LDSs
with a representative LDS. Ideally one would like to define an average in a such a
way that LDSs of the same order have an average of the same order and not higher,
otherwise the problem can become intractable. In fact, most existing approaches tend
to dramatically increase the order of the average LDS, which is certainly undesirable.
Therefore, intuitively, we would like to consider a space S in which the order of the
LDSs is fixed or limited. From a theoretical point of view also this assumption allows
us to work with nicer mathematical spaces namely smooth manifolds (see Sect. 8.4).
Amongst the most widely used classification and clustering algorithms for static
data are the k-nearest neighborhood and k-means algorithms, both of which rely
on a notion of distance (in a feature space) [21]. These algorithms enjoy certain
universality properties with respect to the probability distributions of the data; and
hence in many practical situations where one has little prior knowledge about the
nature of the data, they prove to be very effective [21, 35]. In view of this fact,
in this paper we focus on the notion of distance between LDSs and the stochastic
processes they generate. Hence, a natural question is what space we should use and
what type of distance we should define on it. In Problem 1, obviously, the first two
steps (which are the focus of this paper) have significant impacts on the third one.
One has different choices for the space S, as well as, for geometries on that space.
The gamut ranges from an infinite dimensional linear space to a finite dimensional
(non-Euclidean) manifold, and the geometry can be either intrinsic or extrinsic. By
an intrinsic geometry we mean one in which a shortest path between two points in
a space stays in the space, and by an extrinsic geometry we mean one where the
distance between the two points is measured in an ambient space. In the second part
of this paper, we study our recently developed approach, which is somewhere in
between: to design an easy-to-compute extrinsic distance, while keeping the ambient
space not too large.
This paper is organized as follows: In Sect. 8.2, we review some existing
approaches in geometrization of spaces of stochastic processes. In Sect. 8.3, we focus
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 223

on processes generated by LDSs of fixed order, and in Sect. 8.4, we study smooth
fiber bundle structures over spaces of LDSs generating such processes. Finally, in
Sect. 8.5, we introduce our class of group action induced distances namely the align-
ment distances. The paper is concluded in Sect. 8.6. To avoid certain technicalities
and just to convey the main ideas the proofs are omitted and will appear elsewhere.
We should stress that the theory of alignment distances on spaces of LDSs is still
under development; however, its basics have appeared in earlier papers [1–3]. This
paper for most parts is an extended version of [3].

8.2 A Review of Existing Approaches to Geometrization


of Spaces of Stochastic Processes

This review, in particular, since the subject appears in a range of disciplines is non-
exhaustive. Our emphasis is on the core ideas in defining distances on spaces of
stochastic processes rather than enumerating all such distances. Other sources to
consult may include [9, 10, 25]. In view of Problem 1, our main interest is in the
finite dimensional spaces of LDSs of fixed order and the processes they generate.
However, since such a space can be embedded in the larger infinite dimensional space
of “virtually all processes,” first we consider the latter.

Remark 1 We shall discuss several “distance-like” measures some of which are


known as “distance” in the literature. We will try to use the term distance exclusively
for a true distance namely one which is symmetric, positive definite and obeys the tri-
angle inequality. Due to convention or convenience, we still may use the term distance
for something which is not a true distance, but the context will be clear. A distance-like
measures is called a divergence it is only positive definite and it is called pseudo-
distance, if it is symmetric and obeys the triangle inequality but it is only positive
semi-definitive (i.e., a zero distance between two processes does not imply that they
are the same). As mentioned above, our review is mainly to show different schools of
thought and theoretical approaches in defining distances. Obviously, when it comes
to comparing these distances and their effectiveness (e.g., in terms of recognition
rate in a pattern recognition problem) ultimately things very much depend on the
specific application at hand. Although we should mention that for certain 1D spectral
distances there has been some research about their relative discriminative properties,
especially for applications in speech processing, the relation between such distances
and the human auditory perception system has been studied (see e.g., [9, 25, 26,
29, 49, 54]). Perhaps one aspect that one can judge rather comfortably and indepen-
dently of the specific problem is the associated computational costs of calculating the
distance and other related calculations (e.g., calculating a notion of average). In that
regard, for Problem 1, when the time-series dimension p is very large (e.g., in video
classification problems) our introduced alignment distance (see Sect. 8.5) is cheaper
to calculate relative to most other distances and also renders itself quite effective in
defining a notion of average [1].
224 B. Afsari and R. Vidal

Remark 2 Throughout the paper, unless otherwise stated, by a process we mean


a (real-valued) discrete-time wide-sense (or second order) stationary zero mean
Gaussian regular stochastic process (i.e., one with no deterministic component).
Some of the language used in this paper is borrowed from the statistical signal
processing and control literature for which standard references include [40, 56]. Since
we use the Fourier and z-transforms often and there are some disparities between the
definitions (or notations) in the literature we review some terminologies and estab-
lish some notations. The z-transform of a matrix sequence {ht }+⊂ −⊂ (ht ∈ R
p×m ) is
+⊂ −t
defined as H (z) = −⊂ ht z for z in the complex plane C. By evaluating H (z)
on the unit circle in the complex plane C (i.e., by setting z = eiω , ω ∈ [0, 2π])
we get H (eiω ), the Fourier transform of {ht }+⊂−⊂ , which sometimes we denote by
H (ω). Note that the z-transform of {h−t }+⊂ −1
−⊂ is H (z ) and its Fourier transform
−iω
is H (e ), and since we deal with real sequences it is the same as H (eiω ), the
+⊂
complex conjugate of H (eiω ). Also any matrix sequence {h t }0 defines (causal) a

linear filter via the convolution operation yt = h t ∗ et = τ =0 h τ t−τ on the m-
dimensional sequence t . In this case, we call H (ω) or H (z) the transfer function of
the filter and {ht }+⊂
0 the impulse response of the filter. We also say that t is filtered
by H to generate yt . If H (z) is an analytic function of z outside the unit disk in the
complex plane, then the filter is called asymptotically stable. If the transfer function
H (z) is a rational matrix function of z (meaning that each entry of H (z) is a rational
function of z), then the filter has a finite order state-space (LDS) realization in the
form (8.1). The smallest (minimal) order of such an LDS can be determined as the
sum of the orders of the denominator polynomials (in z) in the entries appearing in a
specific representation (factorization) of H (z), known as the Smith-McMillan form
[40]. For a square transfer function this number (known as the McMillan degree) is,
generically, equal to the order of the denominator polynomial in the determinant of
H (z). The roots of these denominators are the eigenvalues of the A matrix in the
minimal state-space realization of H (z) and the system is asymptotically stable if all
these eigenvalues are inside the unit disk in C.

8.2.1 Geometrizing the Space of Power Spectral Densities

A p-dimensional process { yt } can be identified with its p × p covariance sequence


sequences C y (τ ) = E{ yt y≡
t−τ } (τ ∈ Z), where
≡ denotes matrix transpose and

E{·} denotes the expectation operation under the associated probability measure.
Equivalently, the process can be identified by the Fourier (or z) transform of its
covariance sequence, namely the power spectral density (PSD) P y (ω), which is a
p × p Hermitian positive semi-definite matrix for every ω ∈ [0, 2π].4 We denote
the space of all p × p PSD matrices by P p and its subspace consisting of elements

4Strictly speaking, in order to be the PSD matrix of a regular stationary process, a matrix function
on [0, 2π] must satisfy other mild technical conditions (see [62] for details).
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 225

that are full-rank for almost every ω ∈ [0, 2π] by P +


p . Most of the literature prior to
+
2000 is devoted to geometrization of P1 .
Remark 3 It is worth mentioning that the distances we discuss below here are blind
to correlations, meaning that two processes might be correlated but their distance can
be large or they can be uncorrelated but their distance can be zero. For us the starting
point is the identification of a zero-mean (Gaussian) process with its probability
distribution and hence its PSD. Consider the 1D case for convenience. Then in the
Hilbert space geometry a distance between processes y1t and y2t can be defined
as E{( y1t − y2t )2 } in which case the correlation appears in the distance and a zero
distance means almost surely equal sample paths, whereas in PSD-induced distances
yt and − yt which have completely different sample paths have zero distance. In a
more technical language, the topology induced by the PSD-induced distances on
stochastic processes is coarser than the Hilbert space topology. Hence, perhaps to be
more accurate we should further qualify the distances in this paper by the qualifier
“PSD-induced”. Obviously, the Hilbert space topology may be too restrictive in some
practical applications. Interestingly, in the derivation of the Hellinger distance (see
below) based on the optimal transport principle the issue of correlation shows up
and there optimality is achieved when the two processes are uncorrelated (hence
the distance is computed as if the processes were uncorrelated, see [27, p. 292]
for details). In fact, this idea is also present in our approach (and most of the other
approaches), where in order to compare two LDSs we assume that they are stimulated
with the same input process, meaning uncorrelated input processes with identical
probability distributions (see Sect. 8.3).
The space P p is an infinite dimensional cone which also has a convex linear
structure coming from matrix addition and multiplication by nonnegative reals. The
most immediate distance on this space is the standard Euclidean distance:

dE2 ( y1 , y2 ) = ∇P y1 (ω) − P y2 (ω)∇2 dω, (8.2)

where ∇ · ∇ is a matrix norm (e.g., the Frobenius norm ∇ · ∇ F ). In the 1-dimensional


case (i.e., P1 ) one could also define a distance based on the principle of optimal
decoupling or optimal (mass) transport between the probability distributions of the
two processes [27, p. 292]. This approach results in the formula:
  
 
dH2 ( y1 , y2 ) =  P y1 (ω) − P y2 (ω)2 dω, (8.3)

This distance is derived in [28] and is also called the d̄2 -distance (see also [27, p. 292]).
In view of the Hellinger distance between probability measures [9], the above
distance, in the literature, is also called the Hellinger distance [23]. Interestingly,
dH remains valid as the optimal transport-based distance for certain non-Gaussian
processes, as well [27, p. 292]. The extension of the optimal transport-based defini-
tion to higher dimensions is not straightforward. However, note that in P1 , dH can be
226 B. Afsari and R. Vidal

thought of as a square root version of dE . In fact, the square root based definition can
be easily extended to higher dimensions, e.g., in (8.3) one could simply replace the
scalar square roots with the (matrix) Hermitian square roots of P yi (ω), i = 1, 2 (at
each frequency ω) and use a matrix norm. Recall that the Hermitian square root of
the Hermitian matrix Y is the unique Hermitian solution of the equation Y = X X H ,
where H denotes conjugate transpose. We denote the Hermitian square root of Y as
Y 1/2 . Therefore, we could define the Hellinger distance in higher dimensions as

1/2 1/2
dH2 ( y1 , y2 ) = ∇P y1 (ω) − P y2 (ω)∇2F dω, (8.4)

However note that, for any unitary matrix U , X = Y 1/2 U is also a solution to
Y = X X H (but not Hermitian if U differs from the intensity). This suggests that,
one may be able to do better by finding the best unitary matrix U (ω) to minimize
1/2 1/2
∇P y1 (ω)− P y2 (ω)U (ω)∇ F (at each frequency ω). In [23] this idea has been used to
define the (improved) Hellinger distance on P p , which can be written in closed-form
as

1/2 1/2  1/2 1/2 −1/2 1/2 1/2 2
dH2 ◦ ( y1 , y2 ) = ∇P y1 − P y2 P y2 P y1 P y2 P y2 P y1 ∇ F dω, (8.5)

where dependence of the terms on ω has been dropped. Notice that the matrix U (ω) =
 1/2 1/2 −1/2 1/2 1/2
P y2 P y1 P y2 P y2 P y1 is unitary for every ω and in fact it is a transfer function
of an all-pass possibly infinite dimensional linear filter [23]. Here, by an all-pass
transfer function or filter U (ω) we mean one for which U (ω)U (ω) H = I p . Also note
that (8.5) seemingly breaks down if either of the PSDs is not full-rank. However,
solving the related optimization shows that by continuity the expression remains
valid. We should point out that recently a class of distances on P1 has been introduced
by Georgiou et al. based on the notion of optimal mass transport or morphism between
PSDs (rather than probability distributions, as above) [25]. Such distances enjoy some
nice properties, e.g., in terms of robustness with respect to multiplicative and additive
noise [25]. An extension to P p also has been proposed [53]; however, the extension
is no longer a distance and it is not clear if it inherits the robustness property.
Another (possibly deeper) aspect of working with the square root of the PSD
is related to the ideas of spectral factorization and the innovations process. We
review some basics, which can be found, e.g., in [6, 31, 32, 38, 62, 65]. The
important fact is that the PSD P y (ω) of a regular process yt in P p is of constant
rank m → p almost everywhere in [0, 2π]. Moreover, it admits a factorization
of the form P y (ω) = Pl y (ω)Pl y (ω) H , where Pl y (ω) is p × m-dimensional and
uniquely determines its analytic extension Pl y (z) outside the unit disk in C. In
this factorization, Pl y (ω), itself, is not determined uniquely and any two such fac-
tors are related by an m × m-dimensional all-pass filter. However, if we require
the extension Pl y (z) to be in the class of minimum phase filters, then the choice
of the factor Pl y (ω) becomes unique up to a constant unitary matrix. A p × m
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 227

(m → p) transfer function matrix H (z) is called minimum phase if it is analytic


outside the unit disk and of constant rank m there (including at z = ⊂). Such a
filter has an inverse filter, which is asymptotically stable. We denote this particular
factor of P y by P+ y and call it the canonical spectral factor. The canonical factor is
still not unique, but the ambiguity is only in a constant
 m × m unitary matrix. The
consequence is that yt can be written as yt = ⊂ τ =0 +τ t−τ , where the p × m
p
matrix sequence {p+t }⊂ t=0 is the inverse Fourier transform of P+ y (ω) and t is an
m-dimensional white noise process with covariance equal to the identity matrix Im .
This means that yt is the output of a linear filter (i.e., an LDS of possibly infinite
order) excited by a white noise process with standard covariance. The process t
is called the innovations process or fundamental process of yt . Under the Gaussian
assumption the innovation process is determined uniquely, otherwise it is determined
up to an m × m unitary factor. The important case is when P y (z) is full-rank outside
the unit disk, in which case the inverse filter P+−1y is well-defined and asymptotically
stable, and one could recover the innovations process by filtering yt by its whitening
filter P+−1y .
Now, to compare two processes, one could somehow compare their canonical
spectral factors5 or if they are in P + p their whitening filters. In [38] a large class
of divergences based on the idea of comparing associated whitening filters (in the
frequency domain) have been proposed. For example, let P+ yi be the canonical
factor of P yi , i = 1, 2. If one filters yit , i = 1, 2, with P+−1y j , j = 1, 2, then the
output PSD is P+−1y j P yi P+−H
yj
. Note that when i = j then the output PSD is I p across
every frequency. It can be shown that d I ( y1 , y2 ) = tr(P+−1y1 P y2 P+−H
y1
− I p) +
tr(P+−1y2 P y1 P+−H
y2
− I p )dω is a symmetric divergence [38]. Note that d I ( y1 , y2 ) is
independent of the unitary ambiguity in the canonical factor and in fact

d I ( y1 , y2 ) = tr(P y−1 −1
1 P y2 + P y2 P y1 − 2I p )dω. (8.6)

Such divergences enjoy certain invariance properties, e.g., if we filter both processes
with a common minimum phase filter, then the divergence remains unchanged. In
particular, it is scale-invariant. Such properties are shared by the distances or diver-
gences that are based on the ratios of PSDs (see below for more examples). Scale
invariance in the case of 1D PSDs has been advocated as a desirable property, since in
many cases the shape of the PSDs rather than their relative scale is the discriminative
feature (see e.g., [9, 26]).
One can arrive at similar distances from other geometric or probabilistic paths.
One example is the famous Itakura-Saito divergence (sometimes called distance)

5 In fact, our approach (in Sects. 8.3–8.5) is also based on the idea of comparing the minimum phase
(i.e., canonical) filters or factors in the case of processes with rational spectra. However, instead of
comparing the associated transfer functions or impulse responses, we try to compare the associated
state-space realizations (in a specific sense). This approach, therefore, is in some sense structural or
generative, since it tries to compare how the processes are generated (according to the state-space
representation) and the model order plays an explicit role in it.
228 B. Afsari and R. Vidal

between PSDs in P1+ which is defined as



P y1 P y1
dIS ( y , y ) =
1 2
− log − 1 dω. (8.7)
P y2 P y2

This divergence has been used in practice, at least, since the 1970s (see [48] for
references). The Itakura-Saito divergence can be derived from the Kullback-Leibler
divergence between (infinite dimensional) probability densities of the two processes
(The definition is a time-domain based definition, however, the final result is read-
ily expressible in the frequency domain).6 On the other hand, Amari’s information
geometry-based approach [5, Chap. 5] allows to geometrize P1+ in various ways and
yields different distances including the Itakura-Saito distance (8.7) or a Riemannian
distance such as 
 P y1  2
dR ( y , y ) =
2 1 2
log dω. (8.8)
P y2

Furthermore, in this framework one can define geodesics between two processes
under various Riemannian or non-Riemannian connections. The high-dimensional
version of the Itakura-Saito distance has also been known since the 1980s [42] but
is less used in practice:

 
dIS ( y1 , y2 ) = trace(P y−1 −1
2 P y1 ) − log(det(P y2 P y1 )) − p dω. (8.9)

Recently, in [38] a Riemannian framework for geometrization of P +


p for p ∅ 1 has
been proposed, which yields Riemannian distances such as:

 −1/2 −1/2  2
dR2 ( y1 , y2 ) = ∇ log P y1 P y2 P y1 ∇ F dω, (8.10)

where log is the standard matrix logarithm. In general, such approaches are not suited
for large p due to computational costs and the full-rankness requirement. We should
stress that in (very) high dimensions the assumption of full-rankness of PSDs is not
a viable one, in particular because usually not only the actual time series are highly
correlated but also the contaminating noises are correlated, as well. In fact, this has
lead to the search for models capturing this quality. One example is the class of
generalized linear dynamic factor models, which are closely related to the tall, full
rank LDS models (see Sect. 8.3.3 and [20, 24]).
Letting the above mentioned issues aside, for the purposes of Problem 1, the space
P p (or even P +p ) is too large. The reason is that it includes, e.g., ARMA processes of
arbitrary large orders, and it is not clear, e.g., how an average of some ARMA models

6Notice that defining distances between probability densities in the time domain is a more general
approach than the PSD-based approaches, and it can be employed in the case of nonstationary as
well as non-Gaussian processes. However, such an approach, in general, is computationally difficult.
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 229

or processes of equal order might turn out. As mentioned before, it is convenient or


reasonable to require the average to be of the same order.7

8.2.2 Geometrizing the Spaces of Models

Any distance on P p (or P + p ) induces a distance, e.g., on a subspace corresponding


to AR or ARMA models of a fixed order. This is an example of an extrinsic distance
induced from an infinite dimensional ambient space to a finite dimensional subspace.
In general, this framework is not ideal and we might try to, e.g., define an intrinsic
distance on the finite dimensional subspace. In fact, Amari’s original paper [4] lays
down a framework for this approach, but lacks actual computations. For the one-
dimensional case in [61], based on Amari’s approach, distances between models in
the space of ARMA models of fixed order are derived. For high order models or
in high dimensions, such calculations are, in general, computationally difficult [61].
The main reason is that the dependence of PSD-based distances on state-space or
ARMA parameters is, in general, highly nonlinear (the important exception is for
parameters of AR models, especially in 1D).
Alternative approaches have also been pursued. For example, in [57] the main
idea is to compare (based on the Δ2 norm) the coefficients of the infinite order AR
models of two processes. This is essentially the same as comparing (in the time
domain) the whitening filters of the two processes. This approach is limited to P + p
and computationally demanding for large p. See [19] for examples of classification
and clustering of 1D time-series using this approach. In [8], the space of 1D AR
processes of a fixed order is geometrized using the geometry of positive-definite
Toeplitz matrices (via the reflection coefficients parameterization), and, moreover,
L p averaging on that space is studied. In [50] a (pseudo)-distance between two
processes is defined through a weighted Δ2 distance between the (infinite) sequences
of the cepstrum coefficients of the two processes. Recall that the cepstrum of a 1D
signal is the inverse Fourier transform of the logarithm of the magnitude of the Fourier
transform of the signal. In the frequency domain this distance (known as the Martin
distance) can be written as (up to a multiplicative constant)
 2
1 P y1
2
dM ( y1 , y2 ) = D 2 log dω, (8.11)
P y2

where Dλ is the fractional derivative operator in the frequency domain interpreted


as multiplication of the corresponding Fourier coefficients in the time domain by
eπiλ/2 n λ for n ∅ 0 and by e−πiλ/2 (−n)λ for n < 0. Notice that d M is scale-invariant
in the sense described earlier and also it is a pseudo-distance since it is zero if the
PSDs are multiple of each other (this is a true scale-invariance property, which in

7Interestingly, for an average defined based on the Itakura-Saito divergence in the space of 1D AR
models this property holds [26], see also [5, Sect. 5.3].
230 B. Afsari and R. Vidal

certain applications is highly desirable).8 Interestingly, in the case of 1D ARMA


models, d M can be expressed conveniently in closed form in terms of the poles and
zeros of the models [50]. Moreover, in [18] it is shown that d M can be calculated
quite efficiently in terms of the parameters of the state-space representation of the
ARMA processes. In fact, the Martin distance has a simple interpretation in terms
of the subspace angles between the extended observability matrices (cf. Sect. 8.4.3)
of the state-space representations [18]. This brings about important computational
advantages and has allowed to extend a form of Martin distance to higher dimensions
(see e.g., [16]). However, it should be noted that the extension of the Martin distance
to higher dimensions in such a way that all its desirable properties carry over has
proven to be difficult [13].9 Nevertheless, some extensions have been quite effective
in certain high-dimensional applications, e.g., video classification [16]. In [16], the
approach of [18] is shown to be a special case of the family of Binet-Cauchy kernels
introduced in [64], and this might explain the effectiveness of the extensions of the
Martin distance to higher dimensions.
In summary, we should say that the extensions of the geometrical methods dis-
cussed in this section to P p for p > 1 do not seem obvious or otherwise they are
computationally very expensive. Moreover, these approaches often yield extrinsic
distances induced from infinite dimensional ambient spaces, which, e.g., in the case
of averaging LDSs of fixed order can be problematic.

8.2.3 Control-Theoretic Approaches

More relevant to us are [33, 46], where (intrinsic) state-space based Riemannian dis-
tances between LDSs of fixed size and fixed order have been studied. Such approaches
ideally suit Problem 1, but they are computationally demanding. More recently, in
[1] and subsequently in [2, 3], we introduced group action induced distances on
certain spaces of LDSs of fixed size and order. As it will become clear in the next
section, an important feature of this approach is that the LDS order is explicit in the
construction of the distance, and the state-space parameters appear in the distance
in a simple form. These features make certain related calculations (e.g., optimiza-
tion) much more convenient (compared with other methods). Another aspect of our
approach is that, contrary to most of the distances discussed so far, which compare
the PSDs or the canonical factors directly, our approach amounts to comparing the

8 It is interesting to note that by a simple modification some of the spectral-ratio based distances
2 ( y1 , y2 ) =
  P 1 2
can attain this property, e.g., by modifying dR in (8.8) as dRI log P y2 dω −
y
  P 1  2
log P y2 dω (see also [9, 25, 49]).
y
9 This and the results in [53] underline the fact that defining distances on P p for p > 1 may
be challenging, not only from a computational point of view but also from a theoretical one. In
particular, certain nice properties in 1D do not automatically carry over to higher dimensions by a
simple extension of the definitions in 1D.
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 231

generative or the structural models of the processes or how they are generated. This
feature also could be useful in designing more application-specific or structure-aware
distances.

8.3 Processes Generated by LDSs of Fixed Order

Consider an LDS, M, of the form (8.1) with a realization R = (A, B, C, D) ∈


 m,n, p .10 In the sequel, for various reasons, we will restrict ourselves to increasingly
SL
smaller submanifolds of SL  m,n, p , which will be denoted by additional superscripts.
Recall that the p × m matrix transfer function is T (z) = D + C(In − z −1 A)−1 B,
where z ∈ C and In is the n-dimensional identity matrix. We assume that all LDSs are
excited by the standard white Gaussian process. Hence, the output PSD matrix (in the
z-domain) is the p × p matrix function P(z) = T (z)T ≡ (z −1 ). The PSD is a rational
matrix function of z whose rank (a.k.a. normal rank) is constant almost everywhere
in C. Stationarity of the output process is guaranteed if M is asymptotically stable.
We denote the submanifold of such realizations by SL  m,n, p .
 am,n, p ← SL

8.3.1 Embedding Stochastic Processes in LDS Spaces

Two (stochastic) LDSs are indistinguishable if their output PSDs are equal. Using this
equivalence on the entire set of LDSs is not useful, because, as mentioned earlier two
transfer functions which differ by an all-pass filter result in the same PSD. Therefore,
the equivalence relation could induce a complicated many-to-one correspondence
between the LDSs and the subspace of stochastic processes they generate. However, if
we restrict ourselves to the subspace of minimum phase LDSs the situation improves.
Let us denote the subspace of minimum-phase realizations by SL  m,n,
a,mp a
p ← SLm,n, p .
This is clearly an open submanifold of SL  am,n, p . In SL
 m,n,
a,mp
p , the canonical spectral
factorization of the output PSD is unique up to an orthogonal matrix [6, 62, 65]: let
 m,n,
T1 (z) and T2 (z) have realizations in SL
a,mp ≡ −1 ≡ −1
p and let T1 (z)T1 (z ) = T2 (z)T2 (z ),
then T1 (z) = T2 (z)Φ for a unique Φ ∈ O(m), where O(m) is the Lie group of m ×m
orthogonal matrices. Therefore, any p-dimensional processes with PSD of normal
rank m can be identified with a simple equivalent class of stable and minimum-phase
transfer functions and the corresponding LDSs.11

10 It is crucial to have in mind that we explicitly distinguish between the LDS, M, and its
realization R, which is not unique. As it will become clear soon, an LDS has an equivalent class of
realizations.
11 These rank conditions, interestingly, have differential geometric significance in yielding nice

quotient spaces, see Sect. 8.4.


232 B. Afsari and R. Vidal

8.3.2 Equivalent Realizations Under Internal and External


Symmetries

A fundamental fact is that there are symmetries or invariances due to certain Lie group
actions in the model (8.1). Let G L(n) denote the Lie group of n × n non-singular
(real) matrices. We say that the Lie group G L(n) × O(m) acts on the realization
 m,n, p (or its subspaces) via the action • defined as12
space SL

(P, Φ) • (A, B, C, D) = (P −1 A P, P −1 BΦ, C P, DΦ). (8.12)

One can easily verify that under this action the output covariance sequence (or PSD)
remains invariant. In general, the converse is not true. That is, two output covariance
sequences might be equal while their corresponding realizations are not related via
• (due to non-minimum phase and the action not being free [47], also see below).
Recall that the action of a group on a set is called free if every element of the set is
fixed only by the identity element of the group. For the converse to hold we need to
impose further rank conditions, as we will see next.

8.3.3 From Processes to Realizations (The Rank Conditions)


 m,n, p on) under which
Now, we study some rank conditions (i.e., submanifolds of SL
• is a free action.

8.3.3.1 Observable, Controllable, and Minimal Realizations

Recall that the controllability and observability matrices of order k associated with
a realization R = (A, B, C, D) are defined as Ck = [B, AB, . . . , Ak−1 B] and
Ok = [C ≡ , (C A)≡ , . . . , (C Ak−1 )≡ ]≡ , respectively. A realization is called control-
lable (resp. observable) if Ck (resp. Ok ) is of rank n for k = n. We denote the
subspace of controllable (resp. observable) realizations by SL  co  ob
m,n, p (resp. SLm,n, p ).
 min
The space SL  co  ob
m,n, p = SLm,n, p ≤ SLm,n, p is called the space of minimal realizations.
An important fact is that we cannot reduce the order (i.e., the size of A) of a minimal
realization without changing its input-output behavior.

8.3.3.2 Tall, Full Rank LDSs

Another (less studied) rank condition is when C is of rank n (here p ∅ n is required).


Denote by SL tC  ob
m,n, p ← SLm,n, p the subspace of such realizations and call a cor-
responding LDS tall and full-rank. Such LDSs are closely related to generalized

12 Strictly speaking • is a right action; however, it is notationally convenient to write it as a left

action in (8.12).
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 233

linear dynamic factor models for (very) high-dimensional time series [20] and also
appear in video sequence modeling [1, 12, 60]. It is easy to verify that all the above
realization spaces are smooth open submanifolds of SL  m,n, p . Their corresponding
submanifolds of stable or minimum-phase LDSs (e.g., SL  m,n,
a,mp,co
p ) are defined in an
obvious way.
The following proposition forms the basis of our approach to defining distances
between processes: any distance on the space of LDSs with realizations in the above
submanifolds (with rank conditions) can be used to define a distance on the space of
processes generated by those LDSs.

Proposition 1 Let Γ̃m,n, p be SL  m,n,


a,mp,co  a,mp,ob  a,mp,min  a,mp,tC
p , SLm,n, p , SLm,n, p , or SLm,n, p .
Consider two realizations R1 , R2 ∈ Γ̃m,n, p excited by the standard white Gaussian
process. Then we have:
1. If (P, Φ)• R1 = R2 for some (P, Φ) ∈ G L(n)× O(m), then the two realizations
generate the same (stationary) output process (i.e., outputs have the same PSD
matrices).
2. Conversely, if the outputs of the two realizations are equal (i.e., they have the same
PSD matrices), then there exists a unique (P, Φ) ∈ G L(n) × O(m) such that
(P, Φ) •
R1 = R2 .

8.4 Principal Fiber Bundle Structures over Spaces of LDSs

As explained above, an LDS, M, has an equivalent class of realizations related by


the action •. Hence, M sits naturally in a quotient space, namely SL  m,n, p /(G L(n) ×
O(m)). However, this quotient space is not smooth or even Hausdorff. Recall that if
a Lie group G acts on a manifold smoothly, properly, and freely, then the quotient
space has the structure of a smooth manifold [47]. Smoothness of • is obvious. In
general, the action of a non-compact group such as G L(n) × O(m) is not proper.
However, one can verify that the rank conditions we imposed in Proposition 1 are
enough to make • both a proper and free action on the realization submanifolds
(see [2] for a proof). The resulting quotient manifolds are denoted by dropping the
superscript ∼ , e.g., SLm,n, p . The next theorem, which is an extension of existing
a,mp,min

results, e.g., in [33] shows that, in fact, we have a principal fiber bundle structure.
Theorem 1 Let Γ̃m,n, p be as in Proposition 1 and Γm,n, p = Γ̃m,n, p /(G L(n) ×
O(m)) be the corresponding quotient LDS space. The realization-system pair
(Γ̃m,n, p , Γm,n, p ) has the structure of a smooth principal fiber bundle with structure
a,mp,tC
group G L(n) × O(m). In the case of SLm,n, p the bundle is trivial (i.e., diffeomor-
phic to a product), otherwise it is trivial only when m = 1 or n = 1.
The last part of the theorem has an important consequence. Recall that a principal
bundle is trivial if it diffeomorphic to global product of its base space and its structure
234 B. Afsari and R. Vidal

group. Equivalently, this means that a trivial bundle admits a global smooth cross
section or what is known as a smooth canonical form in the case of LDSs, i.e., a
globally smooth mapping s : Γm,n, p → Γm,n, p which assigns to every system a
unique realization. This theorem implies that the minimality condition is a compli-
cated nonlinear constraint, in the sense that it makes the bundle twisted and nontrivial
for which no continuous canonical form exists. Establishing this obstruction put an
end to control theorists’ search for canonical forms for MIMO LDSs in the 1970s
and explained why system identification for MIMO LDSs is a challenging task [11,
15, 36].
On the other hand, one can verify that (SL  a,mp,tC a,mp,tC
m,n, p , SLm,n, p ) is a trivial bundle.
Therefore, for such systems global canonical forms exist and they can be used to
a,mp,tC
define distances, i.e., if s : SLm,n, p → SL  m,n,
a,mp,tC
p is such a canonical form then
a,mp,tC
dSLa,mp,tC (M1 , M2 ) = d̃SLa,mp,tC (s(M1 ), s(M2 )) defines a distance on SLm,n, p for
m,n, p m,n, p
any distance d̃SLa,mp,tC on the realization space. In general, unless one has some
m,n, p
specific knowledge there is no preferred choice for a section or canonical form. If
one has a group-invariant distance on the realization space, then the distance induced
from using a cross section might be inferior to the group action induced distance, in
the sense it may result in an artificially larger distance. In the next section we review
the basic idea behind group action induced distances in our application.

8.4.1 Group Action Induced Distances

Figure 8.1a schematically shows a realization bundle Γ and its base LDS space
Γ. Systems M1 , M2 ∈ Γ have realizations R1 and R2 in Γ, respectively. Let us
assume that a G = G L(n) × O(n)-invariant distance d̃G on the realization bundle is
given. The realizations, R1 and R2 , in general, are not aligned with each other, i.e.,
d̃G (R1 , R2 ) can be still reduced by sliding one realization along its fiber as depicted
in Fig. 8.1b. This leads to the definition of the group action induced distance:13

dΓ (M1 , M2 ) = inf(P,Φ)∈G d̃Γ̃ ((P, Φ) • R1 , R2 ). (8.13)

In fact, one can show that dΓ (·, ·) is a true distance on Γ, i.e., it is symmetric and
positive definite and obeys the triangle inequality (see e.g., [66]).14
The main challenge in the above approach is the fact that, due to non-compactness
of G L(n), constructing a G L(n) × O(n)-invariant distance is computationally dif-

13 We may call this an alignment distance. However, based on the same principle in Sect. 8.5 we

define another group action induced distance, which we explicitly call the alignment distance. Since
our main object of interest is that distance, we prefer not to call the distance in (8.13) an alignment
distance.
14 It is interesting to note that some of the good properties of the k-nearest neighborhood algorithms

on a general metric space depend on the triangle inequality [21].


8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 235

(a) (b)

Fig. 8.1 Over each LDS in Γ sits a realization fiber. The fibers together form the realization space
(bundle) Γ. If given a G-invariant distance on the realization bundle, then one can define a distance
on the LDS space by aligning any realizations R1 , R2 of the two LDSs M1 , M2 as in (8.13)

ficult. The construction of such a distance can essentially be accomplished by


defining a G L(n) × O(n)-invariant Riemannian metric on the realization space and
solving the corresponding geodesic equation, as well as searching for global min-
imizers.15 Such a Riemannian metric for deterministic LDSs was proposed in [45,
46]. One could also start from (an already invariant) distance on a large ambient
space such as P p and specialize it to the desired submanifold Γ of LDSs to get a
Riemannian manifold on Γ and then thereon solve geodesic equations, etc. to get an
intrinsic distance (e.g., as reported in [33, 34]). Both of these approaches seem very
complicated to implement for the case of very high-dimensional LDSs. Instead, our
approach is to use extrinsic group action induced distances, which are induced from
unitary-invariant distances on the realization space. For that we recall the notion of
reduction of structure group on a principal fiber bundle.

8.4.2 Standardization: Reduction of the Structure Group

Next, we recall the notion of reducing a bundle with non-compact structure group
to one with a compact structure group. This will be useful in our geometrization
approach in the next section. Interestingly, bundle reduction also appears in statistical
analysis of shapes under the name of standardization [43]. The basic fact is that any
principal fiber G-bundle (Γ̃, Γ) can be reduced to an OG-subbundle  OΓ ← Γ̃,
where OG is the maximal compact subgroup of G [44]. This reduction means that
Γ is diffeomorphic to OΓ/OG (i.e., no topological information is lost by going to
the subbundle and the subgroup). Therefore, in our cases of interest we can reduce
a G L(n) × O(m)-bundle to an OG(n, m) = O(n) × O(m)-subbundle. We call

15 This problem, in general, is difficult, among other things, because it is a non-convex (infinite-

dimensional) variational problem. Recall that in Riemannian geometry the non-convexity of the arc
length variational problem can be related to the non-trivial topology of the manifold (see e.g., [17]).
236 B. Afsari and R. Vidal

(a) (b)

Fig. 8.2 A standardized subbundle  OΓ m,n, p of Γm,n, p is a subbundle on which G acts via its
compact subgroup OG. The quotient space  OΓ m,n, p /OG still is diffeomorphic to the base space
Γm,n, p . One can define an alignment distance on the base space by aligning realizations R1 , R2 ∈

OΓ m,n, p of M1 , M2 ∈ Γm,n, p as (8.15)

such a subbundle a standardized realization space or (sub)bundle. One can perform


reduction to various standardized subbundles and there is no canonical reduction.
However, in each application one can choose an interesting one. A reduction is in
spirit similar to the Gram-Schmidt orthonormalization [44, Chap. 1]. Figure 8.2a
shows a standardized subbundle OΓ in the realization bundle Γ.

8.4.3 Examples of Realization Standardization

As an example consider R = (A, B, C, D) ∈ SL  a,mp,tC


m,n, p , and let C = U P be

an orthonormalization of C, where U U = In and P ∈ G L(n). Now the new
a,mp,tC
 m,n, p = {R ∈
realization R̂ = (P −1 , Im ) • R belongs to the O(n)-subbundle OSL
 a,mp,tC
SL ≡
m,n, p |C C = In }.
Other forms of bundle reduction, e.g., in the case of the nontrivial bundle
 a,mp,min
SL m,n, p are possible. In particular, via a process known as realization balanc-
ing (see [2, 37]), we can construct a large family of standardized subbundles. For
example, a more sophisticated one is in the case of SL  m,n,
a,mp,min
p via the notion of
(internal) balancing. Consider the symmetric n × n matrices Wc = C⊂ C⊂ ≡ and

Wo = O⊂ O⊂ , which are called controllability and observability Gramians, respec-
tively, and where C⊂ and O⊂ are called extended controllability and observability
matrices, respectively (see the definitions in Sect. 8.3.3.1 with k = ⊂). Due to the
minimality assumption, both Wo and Wc are positive definite. Notice that under the
action •, Wc transforms to P −1 Wc P −≡ and Wo to P ≡ Wo P. Consider the function
h : G L(n) → R defined as h(P) = trace(P −1 Wc P −≡ + P ≡ Wo P). It is easy to
see that h is constant on O(n). More importantly, it can be shown that any critical
point P1 of h is global minimizer and if P2 is any other minimizer then P1 = P2 Q
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 237

for some Q ∈ O(n) [37]. Minimizing h is called balancing (in the sense of Helmke
[37]). One can show that balancing is, in fact, a standardization in the sense that we
defined (a proof of this fact will appear elsewhere). Note that a more specific form
of balancing called diagonal balancing (due to Moore [52]) is more common in the
control literature, however, that cannot be considered as a form of reduction of the
structure group. The interesting intuitive reason is that it tries to reduce the structure
group beyond the orthogonal group to the identity element, i.e., to get a canonical
form (see also [55]). However, it fails in the sense that, as mentioned above, it cannot
a,mp,min
give a smooth canonical form, i.e., a section which is diffeomorphic to SLm,n, p .

8.5 Extrinsic Quotient Geometry and the Alignment Distance

In this section, we propose to use the large class of extrinsic unitary invariant distances
on a standardized realization subbundle to build distances on the LDS base space.
The main benefits are that such distances are abundant, the ambient space is not
too large (e.g., not infinite dimensional), and calculating the distance in the base
space boils down to a static optimization problem (albeit non-convex). Specifically,
let d̃

be a unitary invariant distance on a standardized realization subbundle
m,n, p

OΓ m,n, p with the base Γm,n, p (as in Theorem 1). One example of such a distance is

d̃ 2 R1 ,R2) = λ A ∇A1 − A2 ∇2F + λ B ∇B1 − B2 ∇2F + λC ∇C1 −C2 ∇2F + λ D ∇D1 − D2 ∇2F ,

OΓ m,n, p
(8.14)

where λ A , λ B , λC , λ D > 0 are constants and ∇ · ∇ F is the matrix Frobenius norm.


A group action induced distance (called the alignment distance) between two LDSs
M1 , M2 ∈ Γm,n, p with realizations R1 , R2 ∈  OΓ m,n, p is found by solving the
realization alignment problem (see Fig. 8.2b)
 
2
dΓ (M1 , M2 ) = min 2
d̃ (Q, Φ) • R1 , R2 . (8.15)
m,n, p
(Q,Φ)∈O(n)×O(m) O Γ m,n, p

In [39] a fast algorithm is developed which (with little modification) can be used to
compute this distance.
Remark 4 We stress that, via the identification of a process with its canonical spec-
tral factors (Proposition 1 and Theorem 1), dΓm,n, p (·, ·) is (or induces) a distance on
the space of processes generated by the LDSs in Γm,n, p . Therefore, in the sprit
of distances studied in Sect. 8.2 we could have written dΓm,n, p ( y1 , y2 ) instead of
dΓm,n, p (M1 , M2 ), where y1 and y2 are the processes generated by M1 and M2 when
excited by the standard Gaussian process. However, the chosen notation seems more
convenient.
Remark 5 Calling the static global minimization problem (8.15) “easy” in an
absolute term is an oversimplification. However, even this global minimization
238 B. Afsari and R. Vidal

over orthogonal matrices is definitely simpler than solving the nonlinear geodesic
ODEs and finding shortest geodesics globally (an infinite-dimensional dynamic
programming problem). It is our ongoing research to develop fast and reliable algo-
rithms to solve (8.15). Our experiments indicate that the Jacobi algorithm in [39] is
quite effective in finding global minimizers.
a,mp,tC
In [1], this distance was first introduced on SLm,n, p with the standardized
a,mp,tC
subbundle OSL m,n, p . The distance was used for efficient video sequence clas-
sification (using 1-nearest neighborhood and nearest mean methods) and clustering
(e.g., via defining averages or a k-means like algorithm). However, it should be men-
tioned that in video applications (for reasons which are not completely understood)
the comparison of LDSs based on the (A, C) part in (8.1) has proven quite effective
(in fact, such distances are more commonly used than distances based on compar-
ing the full model). Therefore, in [1], the alignment distance (8.15) with parameters
λ B = λ D = 0 was used, see (8.14). An algorithm called the align and average is
a,mp,tC
developed to do averaging on SLm,n, p (see also [2]). One defines the average M̄ of
a,mp,tC
LDSs {Mi }i=1N ← SL
m,n, p (the so-called Fréchet mean or average) as a minimizer
of the sum of the squares of distances:


N
M̄ = argmin M d2 a,mp,tC (M, Mi ). (8.16)
SLm,n, p
i=1

The align and average algorithm is essentially an alternating minimization algorithm


to find a solution. As a result, in each step it aligns the realizations of the LDSs
Mi to that of the current estimated average, then a Euclidean average of the aligned
realizations is found and afterwards the found C matrix is orthonormalized, and the
algorithm iterates these steps till convergence (see [1, 2] for more details). A nice
feature of this algorithms is that (generically) the average LDS M̄ by construction will
be of order n and minimum phase (and under certain conditions stable). An interesting
question is whether the average model found this way is asymptotically stable, by
construction. The most likely answer is, in general, negative. However, in a special
case it can be positive. Let ∇A∇2 denote the 2-norm (i.e., the largest singular value)
a,mp,tC
of the matrix A. In the case the standardized realizations Ri ∈ OSL  m,n, p , (1 →
i → N ) are such that ∇Ai ∇2 < 1(1 → i → N ), then by construction the 2-norm of
the A matrix of the average LDS will also be less than 1. Hence, the average LDS
will be asymptotically stable. Moreover, as mentioned in Sect. 8.4.3, in the case of
a,mp,min
SLm,n, p we may employ the subbundle of balanced realizations as the standardized
subbundle. It turns out that in this case preserving stability (by construction) can be
easier, but the averaging algorithm gets more involved (see [2] for some more details).
Obviously, the above alignment distance based on (8.14) is only an example. In a
pattern recognition application, a large class of such distances can be constructed and
among them a suitable one can be chosen or they can be combined in a machine learn-
ing framework (such distances may even correspond to different standardizations).
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 239

8.5.1 Extensions

Now, we briefly point to some possible directions along which this basic idea can
be extended (see also [2]). First, note that the Frobenius norm in (8.14) can be
replaced by any other unitary invariant matrix norm (e.g., the nuclear norm). A less
trivial extension is to get rid of O(m) in (8.15) by passing to covariance matri-
a,mp,tC
 m,n, p it is easy to verify that SLm,n,
ces. For example, in the case of OSL
a,mp,tC
p =
a,mp,tC,cv a,mp,tC,cv
 m,n, p
OSL  m,n, p
/(O(n) × Im ), where OSL = {(A, Z , C, S)|(A, B, C, D) ∈
a,mp,tC
 m,n, p ,
OSL Z = B B ≡ , S = D D ≡ }. On this standardized subspace one only has the
action of O(n) which we denote as Q Ω (A, Z , C, S) = (Q ≡ AQ, Q ≡ Z Q, C Q, S).
One can use the same ambient distance on this space as in (8.14) and get
 
2
dΓ (M1 , M2 ) = min d̃
2
Q Ω R1 , R2 , (8.17)
m,n, p
Q∈O(n) O Γ m,n, p

a,mp,tC,cv
for realizations R1 , R2 ∈ OSL m,n, p . One could also replace the ∇ · ∇ F in the
terms associated with B and D in (8.14) with some known distances in the spaces
of positive definite matrices or positive-semi-definite matrices of fixed rank (see
e.g., [14, 63]). Another possible extension is, e.g., to consider other submanifolds
a,mp,tC
 m,n, p , e.g., a submanifold where ∇C∇ F = ∇B∇ F = 1. In this case the
of OSL
corresponding alignment distance is essentially a scale invariant distance, i.e., two
processes which are scaled version of one another will have zero distance. A more
significant and subtle extension is to extend the underlying space of LDSs of fixed
size and order n to that of fixed size but (minimal) order not larger than n. The details
of this approach will appear elsewhere.

8.6 Conclusion

In this paper our focus was the geometrization of spaces of stochastic processes
generated by LDSs of fixed size and order, for use in pattern recognition of high-
dimensional time-series data (e.g., in the prototype Problem 1). We reviewed some
of the existing approaches. We then studied the newly developed class of group
action induced distances called the alignment distances. The approach is a general
and flexible geometrization framework, based on the quotient structure of the space
of such LDSs, which leads to a large class of extrinsic distances. The theory of
alignment distances and their properties is still in early stages of development and
we are hopeful to be able to tackle some interesting problems in control theory as
well as pattern recognition in time-series data.
240 B. Afsari and R. Vidal

Acknowledgments The authors are thankful to the anonymous reviewers for their insightful
comments and suggestions, which helped to improve the quality of this paper. The authors also
thank the organizers of the GSI 2013 conference and the editor of this book Prof. Frank Nielsen.
This work was supported by the Sloan Foundation and by grants ONR N00014-09-10084, NSF
0941362, NSF 0941463, NSF 0931805, and NSF 1335035.

References

1. Afsari, B., Chaudhry, R., Ravichandran, A., Vidal, R.: Group action induced distances for
averaging and clustering linear dynamical systems with applications to the analysis of dynamic
visual scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
2. Afsari, B., Vidal, R.: The alignment distance on spaces of linear dynamical systems. In: IEEE
Conference on Decision and Control (2013)
3. Afsari, B., Vidal, R.: Group action induced distances on spaces of high-dimensional linear
stochastic processes. In: Geometric Science of Information, LNCS, vol. 8085, pp. 425–432
(2013)
4. Amari, S.I.: Differential geometry of a parametric family of invertible linear systems-
Riemannian metric, dual affine connections, and divergence. Math. Syst. Theory 20, 53–82
(1987)
5. Amari, S.I., Nagaoka, H.: Methods of information geometry. In: Translations of Mathematical
Monographs, vol. 191. American Mathematical Society, Providence (2000)
6. Anderson, B.D., Deistler, M.: Properties of zero-free spectral matrices. IEEE Trans. Autom.
Control 54(10), 2365–5 (2009)
7. Aoki, M.: State Space Modeling of Time Series. Springer, Berlin (1987)
8. Barbaresco, F.: Information geometry of covariance matrix: Cartan-Siegel homogeneous
bounded domains, Mostow/Berger fibration and Frechet median. In: Matrix Information Geom-
etry, pp. 199–255. Springer, Berlin (2013)
9. Basseville, M.: Distance measures for signal processing and pattern recognition. Sig. Process.
18, 349–9 (1989)
10. Basseville, M.: Divergence measures for statistical data processingan annotated bibliography.
Sig. Process. 93(4), 621–33 (2013)
11. Bauer, D., Deistler, M.: Balanced canonical forms for system identification. IEEE Trans.
Autom. Control 44(6), 1118–1131 (1999)
12. Béjar, B., Zappella, L., Vidal, R.: Surgical gesture classification from video data. In: Medical
Image Computing and Computer Assisted Intervention, pp. 34–41 (2012)
13. Boets, J., Cock, K.D., Moor, B.D.: A mutual information based distance for multivariate
Gaussian processes. In: Modeling, Estimation and Control, Festschrift in Honor of Giorgio
Picci on the Occasion of his Sixty-Fifth Birthday, Lecture Notes in Control and Information
Sciences, vol. 364, pp. 15–33. Springer, Berlin (2007)
14. Bonnabel, S., Collard, A., Sepulchre, R.: Rank-preserving geometric means of positive semi-
definite matrices. Linear Algebra. Its Appl. 438, 3202–16 (2013)
15. Byrnes, C.I., Hurt, N.: On the moduli of linear dynamical systems. In: Advances in Mathemat-
ical Studies in Analysis, vol. 4, pp. 83–122. Academic Press, New York (1979)
16. Chaudhry, R., Vidal, R.: Recognition of visual dynamical processes: Theory, kernels and
experimental evaluation. Technical Report 09–01. Department of Computer Science, Johns
Hopkins University (2009)
17. Chavel, I.: Riemannian Geometry: A Modern Introduction, vol. 98, 2nd edn. Cambridge Uni-
versity Press, Cambridge (2006)
18. Cock, K.D., Moor, B.D.: Subspace angles and distances between ARMA models. Syst. Control
Lett. 46(4), 265–70 (2002)
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 241

19. Corduas, M., Piccolo, D.: Time series clustering and classification by the autoregressive metric.
Comput. Stat. Data Anal. 52(4), 1860–72 (2008)
20. Deistler, M., Anderson, B.O., Filler, A., Zinner, C., Chen, W.: Generalized linear dynamic
factor models: an approach via singular autoregressions. Eur. J. Control 3, 211–24 (2010)
21. Devroye, L.: A probabilistic Theory of Pattern Recognition, vol. 31. Springer, Berlin (1996)
22. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. Int. J. Comput. Vision 51(2),
91–109 (2003)
23. Ferrante, A., Pavon, M., Ramponi, F.: Hellinger versus Kullback-Leibler multivariable spec-
trum approximation. IEEE Trans. Autom. Control 53(4), 954–67 (2008)
24. Forni, M., Hallin, M., Lippi, M., Reichlin, L.: The generalized dynamic-factor model: Identi-
fication and estimation. Rev. Econ. Stat. 82(4), 540–54 (2000)
25. Georgiou, T.T., Karlsson, J., Takyar, M.S.: Metrics for power spectra: an axiomatic approach.
IEEE Trans. Signal Process. 57(3), 859–67 (2009)
26. Gray, R., Buzo, A., Gray Jr, A., Matsuyama, Y.: Distortion measures for speech processing.
IEEE Trans. Acoust. Speech Signal Process. 28(4), 367–76 (1980)
27. Gray, R.M.: Probability, Random Processes, and Ergodic Properties. Springer, Berlin (2009)
28. Gray, R.M., Neuhoff, D.L., Shields, P.C.: A generalization of Ornstein’s d̄ distance with appli-
cations to information theory. The Ann. Probab. 3, 315–328 (1975)
29. Gray Jr, A., Markel, J.: Distance measures for speech processing. IEEE Trans. Acoust. Speech
Signal Process. 24(5), 380–91 (1976)
30. Grenander, U.: Abstract Inference. Wiley, New York (1981)
31. Hannan, E.J.: Multiple Time Series, vol. 38. Wiley, New York (1970)
32. Hannan, E.J., Deistler, M.: The Statistical Theory of Linear Systems. Wiley, New York (1987)
33. Hanzon, B.: Identifiability, Recursive Identification and Spaces of Linear Dynamical Systems,
vol. 63–64. Centrum voor Wiskunde en Informatica (CWI), Amsterdam (1989)
34. Hanzon, B., Marcus, S.I.: Riemannian metrics on spaces of stable linear systems, with appli-
cations to identification. In: IEEE Conference on Decision & Control, pp. 1119–1124 (1982)
35. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, New
York (2003)
36. Hazewinkel, M.: Moduli and canonical forms for linear dynamical systems II: the topological
case. Math. Syst. Theory 10, 363–85 (1977)
37. Helmke, U.: Balanced realizations for linear systems: a variational approach. SIAM J. Control
Optim. 31(1), 1–15 (1993)
38. Jiang, X., Ning, L., Georgiou, T.T.: Distances and Riemannian metrics for multivariate spectral
densities. IEEE Trans. Autom. Control 57(7), 1723–35 (2012)
39. Jimenez, N.D., Afsari, B., Vidal, R.: Fast Jacobi-type algorithm for computing distances
between linear dynamical systems. In: European Control Conference (2013)
40. Kailath, T.: Linear Systems. Prentice Hall, NJ (1980)
41. Katayama, T.: Subspace Methods for System Identification. Springer, Berlin (2005)
42. Kazakos, D., Papantoni-Kazakos, P.: Spectral distance measures between Gaussian processes.
IEEE Trans. Autom. Control 25(5), 950–9 (1980)
43. Kendall, D.G., Barden, D., Carne, T.K., Le, H.: Shape and Shape Theory. Wiley Series In
Probability And Statistics. Wiley, New York (1999)
44. Kobayashi, S., Nomizu, K.: Foundations of Differential Geometry Volume I. Wiley Classics
Library Edition. Wiley, New York (1963)
45. Krishnaprasad, P.S.: Geometry of Minimal Systems and the Identification Problem. PhD thesis,
Harvard University (1977)
46. Krishnaprasad, P.S., Martin, C.F.: On families of systems and deformations. Int. J. Control
38(5), 1055–79 (1983)
47. Lee, J.M.: Introduction to Smooth Manifolds. Springer, Graduate Texts in Mathematics (2002)
48. Liao, T.W.: Clustering time series data—a survey. Pattern Recogn. 38, 1857–74 (2005)
49. Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–80 (1975)
50. Martin, A.: A metric for ARMA processes. IEEE Trans. Signal Process. 48(4), 1164–70 (2000)
242 B. Afsari and R. Vidal

51. Moor, B.D., Overschee, P.V., Suykens, J.: Subspace algorithms for system identification and
stochastic realization. Technical Report ESAT-SISTA Report 1990–28, Katholieke Universiteit
Leuven (1990)
52. Moore, B.C.: Principal component analysis in linear systems: Controllability, observability,
and model reduction. IEEE Trans. Autom. Control 26, 17–32 (1981)
53. Ning, L., Georgiou, T.T., Tannenbaum, A.: Matrix-valued Monge-Kantorovich optimal mass
transport. arXiv, preprint arXiv:1304.3931 (2013)
54. Nocerino, N., Soong, F.K., Rabiner, L.R., Klatt, D.H.: Comparative study of several distortion
measures for speech recognition. Speech Commun. 4(4), 317–31 (1985)
55. Ober, R.J.: Balanced realizations: canonical form, parametrization, model reduction. Int. J.
Control 46(2), 643–70 (1987)
56. Papoulis, A., Pillai, S.U.: Probability, random variables and stochastic processes with errata
sheet. McGraw-Hill Education, New York (2002)
57. Piccolo, D.: A distance measure for classifying ARIMA models. J. Time Ser. Anal. 11(2),
153–64 (1990)
58. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall International,
NJ (1993)
59. Rao, M.M.: Stochastic Processes: Inference Theory, vol. 508. Springer, New York (2000)
60. Ravichandran, A., Vidal, R.: Video registration using dynamic textures. IEEE Trans. Pattern
Anal. Mach. Intell. 33(1), 158–171 (2011)
61. Ravishanker, N., Melnick, E.L., Tsai, C.-L.: Differential geometry of ARMA models. J. Time
Ser. Anal. 11(3), 259–274 (1990)
62. Rozanov, Y.A.: Stationary Random Processes. Holden-Day, San Francisco (1967)
63. Vandereycken, B., Absil, P.-A., Vandewalle, S.: A Riemannian geometry with complete geo-
desics for the set of positive semi-definite matrices of fixed rank. Technical Report Report
TW572, Katholieke Universiteit Leuven (2010)
64. Vishwanathan, S., Smola, A., Vidal, R.: Binet-Cauchy kernels on dynamical systems and its
application to the analysis of dynamic scenes. Int. J. Comput. Vision 73(1), 95–119 (2007)
65. Youla, D.: On the factorization of rational matrices. IRE Trans. Inf. Theory 7(3), 172–189
(1961)
66. Younes, L.: Shapes and Diffeomorphisms. In: Applied Mathematical Sciences, vol. 171.
Springer, New York (2010)
Chapter 9
Discrete Ladders for Parallel Transport
in Transformation Groups with an Affine
Connection Structure

Marco Lorenzi and Xavier Pennec

9.1 Introduction

The analysis of complex information in medical imaging and in computer vision


often requires to represent data in suitable manifolds in which we need to compute
trajectories, distances, means and statistical modes. An important example is found in
computational anatomy, which aims at developing statistical models of the anatomi-
cal variability of organs and tissues. Following Thompson [14], we can assume that
the variability of a given observation of the population is encoded by a spatial defor-
mation of a template shape or image (called an atlas in computational anatomy). The
analysis of deformations thus enables the understanding of phenotypic variation and
morphological traits in populations.
Deformations of images and shapes are usually estimated by image registration.
Among the numerous methods used for registering medical images, the rich math-
ematical setting of diffeomorphic non-linear registration is particularly appealing
since it provides elegant and grounded methods for atlas building [21], group-wise
[9], and longitudinal statistical analysis of deformations [7, 15, 17]. In particular,
temporal evolutions of anatomies can be modeled by transformat trajectories in the
space of diffeomorphisms. However, developing population-based models requires
reliable methods for comparing different trajectories.
Among the different techniques proposed so far [10, 17, 37], parallel transport
represents a promising method which relies on a solid mathematical background. At
the infinitesimal level, a trajectory is a tangent vector to a transformation. Parallel
transport consists in transporting the infinitesimal deformation vector across the

M. Lorenzi (B) · X. Pennec


Asclepios Research Project, INRIA Sophia Antipolis, 2004 route des Lucioles BP 93,
06902 Sophia Antipolis, France
e-mail: [email protected]
X. Pennec
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 243


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_9,
© Springer International Publishing Switzerland 2014
244 M. Lorenzi and X. Pennec

manifold by preserving its properties with respect to the space geometry. It is one
of the fundamental operations of differential geometry which enables to compare
tangent vectors, and thus the underlying trajectories, across the whole manifold.
For this reason, parallel transport in transformation groups is currently an impor-
tant field of research with applications in medical imaging to the development of
spatio-temporal atlases for brain images [26], the study of hippocampal shapes [35],
or cardiac motion [1]. In computer vision one also finds applications to motion track-
ing and more generally to statistical analysis [12, 42, 45, 47].
Even though the notion of parallel transport comes in some way intuitively, its
practical implementation requires the precise knowledge of the space geometry, in
particular of the underlying connection. This is not always easy, especially in infinite
dimensions such as in the setting of diffeomorphic image registration. Moreover,
parallel transport is a continuous operation involving the computation of (covariant)
derivatives which, from the practical point of view, might lead to issues concern-
ing numerical stability and robustness. These issues are related to the unavoidable
approximations arising when continuous energy functionals and operators are dis-
cretized on grids, especially concerning the evaluation of derivatives through finite
difference schemes.
The complexity and limitations deriving from the direct computation of contin-
uous parallel transport methods can be alleviated by considering discrete approx-
imations. In the ’70s of the past century [31] proposed a scheme for performing
the parallel transport with a very simple geometrical constructions. This scheme was
called Schild’s Ladder since it was in the spirit of the work of the theoretical physicist
Alfred Schild’s. The computational interest of Schild’s ladder resides in its gener-
ality, since it enables the transport of vectors in manifolds by computing geodesics
only. This way, the implementation of covariant derivatives is not required anymore,
and we can concentrate the implementation effort on the geodesics only.
We recently showed that numerical schemes derived from the Schild’s ladder can
be effectively applied in the setting of diffeomorphic image registration, by appro-
priately taking advantage of the underlying geometrical setting [28, 29]. Based on
this experience, we believe that discrete transport methods represent promising and
powerful techniques for the analysis of transformations due to their simplicity and
generality. Bearing in mind the applicative context of the development of transport
techniques, this chapter aims to illustrate the principles of discrete schemes for par-
allel transport in smooth groups equipped with affine connection or Riemannian
structures.
The chapter is structured as follows. In Sect. 9.2 we provide fundamental notions of
finite-dimensional Lie Groups and Riemannian geometry concerning affine connec-
tion and covariant derivative. These notions are the basis for the continuous parallel
transport methods defined in Sect. 9.3. In Sect. 9.4 we introduce the Schild’s ladder.
After detailing its construction and mathematical properties, we derive from it the
more efficient Pole ladder construction in Sect. 9.4.2. These theoretical concepts are
then contextualized and discussed in the applicative setting of diffeomorphic image
registration, in which some limitations arise when considering infinite dimensional
Lie groups (Sect. 9.5). Finally, after illustrating numerical implementations of the
9 Discrete Ladders for Parallel Transport in Transformation Groups 245

aforementioned discrete transport methods, we show in Sect. 9.6 their effectiveness


when applied to the very practical problem of statistical analysis of the group-wise
longitudinal brain changes in Alzheimer’s disease.
This chapter summarizes and contextualizes a series of contributions concerning
the theoretical foundations of diffeomorphic registration parametrized by stationary
velocity fields (SVFs) and discrete transport methods. The application of Schild’s
ladder in the context of image registration was first presented in [29]. In [28], we
highlighted the geometrical basis of the SVFs registration setting by investigating its
theoretical foundations and its connections with Lie group theory and affine geom-
etry. These insights allowed to define a novel discrete transport scheme, the pole
ladder, which optimizes the Schild ladder by taking advantage of the geometrical
properties of the SVF setting [27].

9.2 Basics of Lie Groups

We recall here the theoretical notions of Lie group theory and affine geometry, that
will be extensively used in the following sections.
A Lie group G is a smooth manifold provided with an identity element id, a smooth
associative composition rule (g, h) ⊂ G × G ∗∈ gh ⊂ G and a smooth inversion
rule g ∗∈ g −1 which are both compatible with the differential manifold structure.
As such, we have a tangent space Tg G at each point g ⊂ G. A vector field X is a
smooth function that maps a tangent vector X|g to each point g of the manifold. The
set of vector fields (the tangent bundle) is denoted T G. Vector fields can be viewed
as the directional (or Lie) derivative of a scalar function α along the vector field at
each point: ∂X α|g = dtd α(τt )|t=0 , where τt is the flow of X and τ0 = g. Composing
directional derivatives ∂X ∂Y α leads in general to a second order derivation. However,
we can remove the second order terms by subtracting ∂Y ∂X α (this can be checked
by writing these expression in a local coordinate system). We obtain the Lie bracket
that acts as an internal multiplication in the algebra of vector fields:

[X, Y](α) = ∂X ∂Y α − ∂Y ∂X α.

Given a group element a ⊂ G, we call left translation La the composition with


the fixed element a on the left: La : g ⊂ G ∗∈ ag ⊂ G. The differential DLa of the
left translation maps the tangent space Tg G to the tangent space Tag G. We say that
a vector field X ⊂ T (G) is left invariant if it remains unchanged under the action of
the left translation: DLa X|g = X|ag . The sub-algebra of left-invariant vector fields
is closed under the Lie bracket and is called the Lie algebra g of the Lie group. Since
a left-invariant vector field is uniquely determined by its value at identity through
the one-to-one map X̃|g = DLg X, the Lie algebra can be identified to the tangent
space at the identity Tid G. One should notice that any smooth vector field can be
written as a linear combination of left-invariant vector fields with smooth functional
coefficients.
246 M. Lorenzi and X. Pennec

Left-invariant vector fields are complete in the sense that their flow τt is defined
for all time. Moreover, this flow is such that τt (g) = gτt (id) by left invariance. The
map X ∗∈ τ1 (id) of g into G is called Lie group exponential and denoted by exp. In
particular, the group exponential defines the one-parameter subgroup associated to
the vector X and has the following properties:
• τt (id) = exp(tX), for each t ⊂ R;
• exp((t + s)X) = exp(tX) exp(sX), for each t, s ⊂ R.
In finite dimension, it can be shown that the Lie group exponential is a diffeomor-
phism from a neighborhood of 0 in g to a neighborhood of id in G.
For each tangent vector X ⊂ g, the one parameter subgroup exp(tX) is a curve
that starts from identity with this tangent vector. One could question if this curve
could be seen as a geodesic like in Riemannian manifolds. To answer this question,
we first need to define what are geodesics. In a Euclidean space, straight lines are
curves which have the same tangent vector at all times. In a manifold, tangent vectors
at different times belong to different tangent spaces. When one wants to compare
tangent vectors at different points, one needs to define a specific mapping between
their tangent spaces: this is the notion of parallel transport. There is generally no
way to define globally a linear operator Δhg : Tg G ∈ Th G which is consistent
g
with composition (i.e. Δhg ◦ Δf = Δhf ). However, specifying the parallel transport
for infinitesimal displacements allows integrating along a path, thus resulting in a
parallel transport that depend on the path. This specification of the parallel transport
for infinitesimal displacements is called the (affine) connection.

9.2.1 Affine Connection Spaces

An affine connection on G is an operator which assigns to each X ⊂ T (G) a linear


mapping ≡X : T (G) ∈ T (G) such that, for each vector field X, Y ⊂ T (G), and
smooth function f , g ⊂ C ∇ (G, R)

≡f X + gY = f ≡X + g≡Y (Linearity); (9.1)


≡X (f Y) = f ≡X (Y) + (Xf )Y (Leibniz rule). (9.2)

The affine connection is therefore a derivation on the tangent space which infinites-
imally maps them from one tangent plane to another.
The connection give rise to two very important geometrical objects: the torsion and
curvature tensors. The torsion quantifies the failure to close infinitesimal geodesic
parallelograms:

T (X, Y) = ≡X Y − ≡Y X − [X, Y],

while the curvature measures the local deviation of the space from being flat, and is
defined as
9 Discrete Ladders for Parallel Transport in Transformation Groups 247

R(X, Y)Z = ≡X ≡Y Z − ≡Y ≡X Z − ≡[X,Y] Z.

Once the manifold is provided with a connection, it is possible to generalize the


notion of “straight line”: a vector field X is parallel along a curve δ(t) if ≡δ̇(t) X = 0
for each t. A path δ(t) on G is said to be straight or geodesic if ≡δ̇ δ̇ = 0, i.e. if its
tangent vector remains parallel to itself along the path.
In a local coordinate system, the geodesic equation is a second order differential
equation. Thus, given a point p ⊂ G and a vector X ⊂ Tp G, there exist a unique
geodesic δ(t, p, X) that passes through p with velocity X at the instant t = 0 [34].
We define therefore the Affine exponential as the application exp : G × T (G) ∈ G
given by expp (X) = δ(1, p, X).
If, as in the Euclidean case, we want to associate to the straight lines the property
of minimizing the distance between points, we need to provide the group G with a
Riemannian manifold structure, i.e. with a metric operator g on the tangent space. In
this case there is a unique connection, called the Levi Civita connection, which, for
each X, Y, Z ⊂ T (G):
• Preserves the metric, i.e. the parallel transport along any curve connecting f to g
is an isometry:

g(X, Y)g = g(Δfg X, Δfg Y)f .

• Is torsion free:

≡X Y − ≡Y X = [X, Y],

thus the parallel transport is symmetric with respect to the Lie bracket.
By choosing the Levi Civita connection of a given Riemannian metric, the affine
geodesics are the length minimizing paths (i.e. classical Riemannian geodesics).
However, given a general affine connection, there may not exist any Riemannian
metric for which affine geodesics are length minimizing.

9.2.2 Cartan-Schouten Connections

Given an affine connection ≡ and a vector X on Tid G, we can therefore define two
curves on G passing through id and having X as tangent vector, one given by the Lie
group exponential exp and the other given by the affine exponential expid . When do
they coincide?
The connection ≡ on G is left-invariant if, for each left translation La (a ⊂ G)
and any vector fields X and Y, we have ≡DLa X (DLa Y) = DLa ≡X (Y). Using two left
invariant vector fields X̃, Ỹ ⊂ g generated by the tangent vectors X, Y ⊂ Tid G, we see
that ≡X̃ Ỹ is itself a left-invariant vector field generated by its value at identity. Since
a connection is completely determined by its action on the left-invariant vector fields
248 M. Lorenzi and X. Pennec

(we can recover the connection on arbitrary vector fields using Eqs. (9.1 and 9.2)
from their decomposition on the Lie Algebra), we conclude that each left-invariant
connection ≡ is uniquely determined by a product ψ (symmetric bilinear operator)
on Tid G through


ψ(X, Y ) = ≡X̃ Ỹ .
id

Notice that such a product can be uniquely decomposed into a commutative part ψ◦ =
◦◦
2 (ψ(X, Y ) + ψ(Y , X)) and a skew symmetric part ψ = 2 (ψ(X, Y ) − ψ(Y , X)).
1 1

The symmetric part specifies the geodesics (i.e. the parallel transport of a vector
along its own direction) while the skew-symmetric part specifies the torsion which
governs the parallel transport of a vector along a transverse direction (the rotation
around the direction of the curve if we have a metric connection with torsion).
Following [34], a left-invariant connection ≡ on a Lie group G is a Cartan-
Schouten connection if, for any tangent vector X at the identity, the one-parameter
subgroups and the affine geodesics coincide, i.e. exp(tX) = δ(t, id, X) . We can see
that a Cartan connection satisfies ψ(X, X) = 0 or, equivalently, is purely skew-sym-
metric.
The one-dimensional family of connections generated by ψ(X, Y ) = φ[X, Y ]
obviously satisfy this skew-symmetry condition. Moreover, the connections of this
family are also invariant by right translation [33], thus invariant by inversion also
since they are already left invariant. This make them particularly interesting since
they are fully compatible with all the group operations.
In this family, three connections have special curvature or symmetric properties
and are called the canonical Cartan-Schouten connections [11]. The zero curvature
connections given by φ = 0, 1 (with torsion T = −[X̃, Ỹ] and T = [X̃, Ỹ] respec-
tively on left invariant vector fields) are called left and right Cartan connections.
The choice of φ = 1/2 leads to average the left and right Cartan connections. It is
called the symmetric
 (or mean)  Cartan connection. It is torsion-free, but has curvature
R(X̃, Ỹ)Z̃ = − 41 [X̃, Ỹ], Z̃ .
As a summary, the three canonical Cartan connections of a Lie group are (for two
left-invariant vector fields):

≡X̃ Ỹ = 0 Left (Torsion, Flat);


1 
≡X̃ Ỹ = X̃, Ỹ Symmetric (Torsion-Free, Curved);
2 
≡X̃ Ỹ = X̃, Ỹ Right (Torsion, Flat).

Since the three canonical Cartan connections only differ by torsion, they share
the same affine geodesics which are the left and right translations of one parameter
subgroups. In the following, we call them group geodesics. However, the parallel
transport of general vectors along these group geodesics is specific to each connection
as we will see below.
9 Discrete Ladders for Parallel Transport in Transformation Groups 249

9.2.3 Riemannian Setting: Levi Civita Connection

Given a metric <X, Y > on the tangent space at identity of a group, one can propagate
this metric to all tangent spaces using left (resp. right) translation to obtain a left-
(resp. right-) invariant Riemannian metric on the group. In the left-invariant case
we have <DLa X, DLa Y >a = <X, Y > and one can show [24] that the Levi Civita
connection is the left-invariant connection generated by the product

1 1
ψ(X, Y ) = [X, Y ] − (ad → (X, Y ) + ad → (Y , X)),
2 2

where the operator ad → is defined by <ad → (Y , X), Z> = <[X, Z], Y > for all
X, Y , Z ⊂ g. A similar formula can be established for right-invariant metrics using
the algebra of right-invariant vector fields .
We clearly see that this left-invariant Levi Civita connection has a symmetric part
which make it differ from the Cartan symmetric connection ψ(X, Y ) = 21 [X, Y ]. In
fact, the quantity ad → (X, X) specifies the rate at which a left invariant geodesic and
a one parameter subgroup starting from the identity with the same tangent vector
X deviates from each-other. More generally, the condition ad → (X, X) = 0 for all
X ⊂ g turns out to be a necessary and sufficient condition to have a bi-invariant
metric [34]. It is important to notice that geodesics of the left- and right-invariant
metrics differ in general as there do not exists bi-invariant metrics even for simple
groups like the Euclidean motions [33]. However, right invariant geodesics can be
easily obtained from the left invariant one through inversion: if α(t) is a left invariant
geodesic joining identity to the transformation α1 , then α−1 (t) is a right-invariant
geodesic joining identity to α−11 .

9.3 Continuous Methods for Parallel Transport

After having introduced the theoretical bases of affine connection spaces, in this
section we detail the theoretical relationship between parallel transport of tangent
vectors and respectively Cartan-Schouten and Riemannian (Levi Civita) connections.

9.3.1 Cartan-Schouten Parallel Transport

For the left Cartan connection, the unique fields that are covariantly constant are the
left-invariant vector fields, and the parallel transport is induced by the differential of
the left translation [34], i.e. ΔL : Tp G ∈ Tq G is defined as

ΔL (X) = DLqp−1 X. (9.3)


250 M. Lorenzi and X. Pennec

One can see that the parallel transport is actually independent of the path, which is
due to the fact that the curvature is null: we are in a space with absolute parallelism.
Similarly, the right-invariant vector fields are covariantly constant with respect to
the right invariant connection only. As above, the parallel transport is given by the
differential of the right translation

ΔR (X) = DRp−1 q X, (9.4)

and we have an absolute parallelism as well.


Finally, the parallel transport for the symmetric Cartan connection is given by
the infinitesimal alternation of the left and right transports. However, as there is
curvature, it depends on the path: it can be shown [18] that the parallel transport of
X along the geodesic exp(tY ) is:

ΔS (X) = DLexp( 1 Y ) DRexp( 1 Y ) X. (9.5)


2 2

9.3.2 Riemannian Parallel Transport

In the Riemannian setting the parallel transport with the Levi Civita connection can
be computed by solving a system of PDEs which locally depend on the associated
metric (the interested reader can refer to [16] for a more comprehensive description
of the parallel transport in Riemannian geometry). Let xi be a local coordinate chart

xi with ∂i = ∂x a local basis of the tangent space. The tangent vector to the curve δ is
⎛ i i ⎛
δ̇ = i v ∂i . It can be easily shown that a vector Y = i yi ∂i is parallel transported
along δ with respect to the affine connection ≡ iff
⎞ ⎧
⎝ ⎝
≡δ̇ Y = ⎠ Φijk vj yk + δ̇(yk )⎨ ∂k = 0, (9.6)
k ij

with the Christoffel⎛symbols of the connection being defined from the covariant
derivative ≡∂i ∂j = k Φijk ∂k .
Let us consider the local expression for the metric tensor glk = < ∂x∂ l , ∂x∂ k >. Thanks
to the compatibility condition, the Christoffel symbols of the Levi Civita connection
can be locally expressed via the metric tensor leading to

⎩ ⎤
1⎝ ∂ ∂ ∂
Φijk = gjl + gli + gij g lk ,
2 ∂xi ∂xj ∂xl
l

with (g lk ) = (glk )−1 .


9 Discrete Ladders for Parallel Transport in Transformation Groups 251

Fig. 9.1 (1) The transport of the vector A along the curve C is performed by the Schild’s ladder
by (2) the construction of geodesic parallelograms in a sufficiently small neighborhood. (3) The
construction is iterated for a sufficient number of neighborhoods

Thus, in the Riemannian setting the covariant derivative is uniquely defined by


the metric, and the parallel transport thus depends from the path δ and from the local
expression of the metric tensor gij (formula 9.6).

9.4 Discrete Methods for Parallel Transport

In Sect. 9.3 we showed that the parallel transport closely depends on the underlying
connection, and that thus it assumes very specific formulations depending on the
underlying geometry. In this section we introduce discrete methods for the compu-
tation of the parallel transport which do not explicitly depend on the connection and
only make use of geodesics. Such techniques could be applied more generally when
working on arbitrary geodesic spaces.

9.4.1 Schild’s Ladder

Schild’s ladder is a general method for the parallel transport, introduced in the theory
of gravitation in [31] after Schild’s similar constructions [39]. The method infinites-
imally transports a vector along a given curve through the construction of geodesic
parallelograms (Fig. 9.1). The Schild’s ladder provides a straightforward method to
compute a second order approximation of the parallel transport of a vector along a
curve using geodesics only.
Let M a manifold and C a curve parametrized by the parameter β with ∂C ∂β |T0 = u,
and A ⊂ TP0 M, a tangent vector on the curve at the point P0 = C(0). Let P1 be a point
on the curve relatively close to P0 , i.e. separated by a sufficiently small parameter
value β .
252 M. Lorenzi and X. Pennec

The Schild’s ladder computes the parallel transport of A along the curve C as
follows:
1. Define a curve on the manifold parametrized by a parameter γ passing through

the point P0 with tangent vector ∂γ |P0 = A. Chose a point P2 on the curve
separated by P0 by the value of the parameters γ. The values of the parameters γ
and β should be chosen in order to construct this step of the ladder within a single
coordinate neighborhood.
2. Let l be the geodesic connecting P2 = l(0) and P1 = l(φ), we choose the “middle
point” P3 = l(φ/2). Now, let us define the geodesic r connecting the starting point
P0 and P3 parametrized by ω such that P3 = r(ω). Extending the geodesic at the
parameter 2ω we reach the point P4 . We can now compute the geodesic curve
connecting P1 and P4 . The vector A◦ tangent to the curve at the point P1 is the
parallel translation of A along C.
3. If the distance between the points P0 and P1 is large, the above construction can
be iterated for a sufficient number of steps.
The algorithmic interest of the Schild’s ladder is that it only relies on the com-
putation of geodesics. Although the geodesics on the manifold are not sufficient to
recover all the information about the space properties, such as the torsion of the con-
nection, it has been shown that the Schild’s ladder implements the parallel transport
with respect to the symmetric part of the connection of the space [23]. An intuitive
view of that point is that the construction of the above diagram is commutative and
can be symmetrized with respect to the points P1 and P2 . If the original connection is
symmetric, then this procedure provides a correct linear approximation of the parallel
transport of vectors.

9.4.2 Pole Ladder

We proposed in [27] a different construction for the parallel transport of vectors


based on geodesics parallelograms. If the curve C is geodesic, then it can be itself
one of the diagonals and the Schild’s ladder can therefore be adapted by requiring
the computation of only one new diagonal of the parallelogram. We define in this
way a different ladder scheme, that we name “pole ladder” since its geometrical
construction recalls the type of ladders with alternating or symmetric steps with
respect to a central axis.
We now prove that the pole ladder is correctly implementing the parallel transport.
In the diagram of Fig. 9.2, the parallel transport of the tangent vector v = Ċ along
the geodesic C is specified by the geodesic equation v̇ + Φijk vi vj = 0 using the
Christoffel symbols Φijk (x). In a sufficiently small neighborhood the relationships
can be linearized to give

vk (t) = vk (0) − tΦijk (x(0))vi (0)vj (0) + O(t 2 ),

and by integrating:
9 Discrete Ladders for Parallel Transport in Transformation Groups 253

Fig. 9.2 The pole ladder par-


allel transports the vector A
along the geodesic C. Con-
trarily to the Schild’s ladder, it
only requires to compute one
diagonal geodesic

t2 k
x k (t) = x k (0) + tvk (0) − Φ (x(0))vi (0)vj (0) + O(t 3 ).
2 ij
By renormalizing the length of the vector v so that C(−1) = P0 , C(0) = M and
C(1) = Q0 (and denoting Φijk = Φijk (M)), we obtain the relations:

1 i j
P0 k = M k − vM
k
− Φijk vM vM + O(∅v∅3 ),
2
1 i j
Q0 k = M k + vM
k
− Φijk vM vM + O(∅v∅3 ).
2
Similarly, we have along the second geodesic:

1 i j
P1 k = M k − uM
k
− Φijk uM uM + O(∅u∅3 ),
2
1 i j
Q1 k = M k + uM
k
− Φijk uM uM + O(∅u∅3 ).
2
Now, to compute the geodesics joining P0 to P1 and Q0 to Q1 , we have to use a
Taylor expansion of the Christoffel symbols Φijk around the point M. In the following,
we indicate the coordinate according to which the quantity is derived by the index
k = ∂ Φk :
after a comma: Φij,a a ij

1 i j 1 k
Φijk (P0 ) = Φijk + Φij,a
k
(−vM
k
− Φijk vM vM ) + Φij,ab vM + O(∅v∅3 ).
a b
vM
2 2

However, the Christoffel symbols are multiplied by a term of order O(∅A∅2 ), so that
only the first term will be quadratic and all others will be of order 3 with respect to
A and vM . Thus, the geodesics joining P0 to P1 and Q0 to Q1 have equations:
254 M. Lorenzi and X. Pennec

1
P1k = P0k + Ak − Φijk Ai Aj + O((∅A∅ + ∅vM ∅)3 ),
2
1
Q1 = Q0 + B − Φijk Bi Bj + O((∅B∅ + ∅vM ∅)3 ).
k k k
2

Equating P1k in the previous equations gives

1 i j 1 i j
k
uM + Φijk uM uM = vM
k
− Ak + Φijk (vM vM + Ai Aj ) + O((∅B∅ + ∅vM ∅)3 ).
2 2
Solving for u as a second order polynomial in vM and A gives

1 j
uk = vM
k
− Ak + (Φijk + Φjik )Ai vM + O((∅A∅ + ∅vM ∅)3 ).
2

Now equating Q1k in the previous equations gives

1 j 1
Bk − Φijk Bi Bj = −Ak + (Φijk + Φjik )Ai vM + Φijk Ai Aj + O((∅A∅ + ∅vM ∅)3 ).
2 2

Solving for Bk as a second order polynomial in vM and A gives:

Bk = −Ak + (Φijk + Φjik )Ai vj + O((∅A∅ + ∅vM ∅)3 ). (9.7)

To verify that this is the correct formula for the parallel transport of A, let us
observe that the field A(x) is parallel in the direction of vj if ≡V A = 0, i.e. if
∂v Ak + Φijk Ai vj = 0, which means that Ak (x + ξv) = Ak − ξΦijk Ai vj + O(ξ2 ). If the
connection is symmetric, i.e. if Φijk = Φjik , Eq. (9.7) shows that the pole ladder leads
to Bk ← −Ak + 2Φijk Ai vj . Thus the pole ladder is realizing the parallel transport for
a length ξ = 2 (remember that our initial geodesic was defined from −1 to 1).
We have thus demonstrated that the vector −B of Fig. 9.2 is a second order approx-
imation of the transport of A. In order to optimize the number of time steps we should
evaluate the error in Eq. (9.7) at high orders on ∅A∅ and ∅vM ∅. The computation is
not straightforward and involves a large number of terms, thus preventing the possi-
bility to synthesize a useful result. However, we believe that the dependency on ∅A∅
is more important that the one on ∅vM ∅, and that we could obtain larger time steps
provided that ∅A∅ is sufficiently small.

9.5 Diffeomorphic Medical Image Registration

We now describe a practical context in which the previous theoretical insights find
useful application. For this purpose, we describe here the link between the theory
9 Discrete Ladders for Parallel Transport in Transformation Groups 255

described in Sect. 9.2 and the context of computational anatomy, in particular through
the diffeomorphic non-linear registration of time series of images.

9.5.1 Longitudinal and Cross-Sectional Registration Settings

Modeling the temporal evolution of the tissues of the body is an important goal of
medical image analysis for understanding the structural changes of organs affected
by a pathology, or for studying the physiological growth during the life span. For
such purposes we need to analyze and compare the observed anatomical differ-
ences between follow-up sequences of anatomical images of different subjects. Non-
rigid registration is one of the main instruments for modeling anatomical differences
from images. The aim of non-linear registration is to encode the observed structural
changes as deformation fields densely represented in the image space, which repre-
sent the warping required to match the observed differences. This way, the anatomical
changes can be modeled and quantified by analyzing the associated deformations.
We can identify two distinct settings for the application of non-linear registration:
longitudinal and cross-sectional. In the former, non-linear registration estimates the
deformation field which explains the longitudinal anatomical (intra-subject) changes
that usually reflect biological phenomena of interest, like atrophy or growth. In the
latter, the deformation field accounts for the anatomical differences between different
subjects (inter-subject), in order to match homologous anatomical regions. These
two settings are profoundly different: the cross-sectional setting does not involve
any physical or mechanical deformations and we might wish to compare different
anatomies with different topologies. Moreover, inter-subject deformations are often
a scale of magnitude higher than the ones characterizing the usually subtle variations
of the longitudinal setting.
In case of group-wise analysis of longitudinal deformations, longitudinal and
cross-sectional settings must be integrated in a consistent manner. In fact, the com-
parison of longitudinal deformations is usually performed after normalizing them in
a common reference frame through the inter-subject registration, and the choice of
the normalization method might have a deep impact on the following analysis. In
order to accurately identify longitudinal deformations in a common reference frame
space, a rigorous and reliable normalization procedure need thus to be defined.
Normalization of longitudinal deformations can be done in different ways,
depending on the analyzed feature. For instance, the scalar Jacobian determinant
of longitudinal deformations represents the associated local volume change, and can
be compared by scalar resampling in a common reference frame via inter-subject
registration. This simple transport of scalar quantities is the basis of the classical
deformation/tensor based morphometry techniques [5, 38]. However, transporting
the Jacobian determinant is not sufficient to reconstruct a deformation in the Template
space.
If we consider vector-values characteristics of deformations instead of scalar quan-
tities, the transport is not uniquely defined anymore. For instance, a simple method
256 M. Lorenzi and X. Pennec

of transport consists in reorienting the longitudinal intra-subject displacement vector


field by the Jacobian matrix of the subject-to-reference deformation. Another intu-
itive method was proposed by [37] and uses the transformation conjugation (change
of coordinate system) in order to compose the longitudinal intra-subject deformation
with the subject-to-reference one. As pointed out in [10], this practice could
potentially introduce variations in the transported deformation and relies on the
inverse consistency of the estimated deformations, which can raise numerical prob-
lems for large deformations.
In the geometric setting, when we have a Riemannian or affine manifold struc-
ture for the space our deformations, one would like to use a normalization which
is consistent with the manifold structure. This requirement naturally raise parallel
transport as the natural tool for normalizing measurements at different points. In
order to elaborate along this idea, we first need to describe the geometric structures
on diffeomorphic deformations.
A first formulation of diffeomorphic registration was proposed with the “Large
Deformation Diffeomorphic Metric Mapping (LDDMM)” setting [8, 44]. In this
framework the images are registered by minimizing the length of the trajectory
of transformations in the space of diffeomorphism, once specified an opportune
right invariant metric. The solution is the endpoint of the flow of a time-varying
velocity field, which is a geodesic parametrized through the Riemannian exponential.
The LDDMM deformations are thus Riemannian (metric) geodesics, which are also
geodesics of the Levi Civita connection.
Since LDDMM is generally computationally intensive, a different diffeomorphic
registration method was later proposed with the stationary velocity field (SVF) setting
[3]. In this case the diffeomorphisms are parametrized by stationary velocity fields,
in opposition to the time varying velocity fields of the LDDMM framework, through
the Lie group exponential. The restriction to stationary velocity fields simplifies the
registration problem and provides efficient numerical schemes for the computation
of deformations. This time the flow associated to SVFs is a one-parameter subgroup,
which is a geodesic with respect to the Cartan-Schouten connections. One-parameter
subgroups are generally not metric geodesics, since there do not exist any left and
right invariant metric on non-compact and non-commutative groups.
In both the LDDMM and SVF settings, the longitudinal deformation is encoded
by the initial tangent velocity field. The transport of longitudinal deformations can be
then naturally formulated as the parallel transport of tangent vectors along geodesics
according to the underlying connection, i.e. the Levi Civita connection in LDDMM,
and the canonical symmetric Cartan-Schouten connection in the SVF setting.

9.5.2 A Glimpse of Lie Group Theory in Infinite Dimension

In Sect. 9.2.2, we derived the equivalence of one-parameter subgroups and the affine
geodesics of the canonical Cartan connections in a finite dimensional Lie group. In
order to use such a framework for diffeomorphisms, we have to generalize the theory
9 Discrete Ladders for Parallel Transport in Transformation Groups 257

to infinite dimensions. However, defining infinite dimensional Lie groups is raising


much more difficulties. This is in fact the reason why Lie himself restricted to finite
dimensions. The theory was developed since the ’70s and is now an active field of
research. We refer the reader to the recent books [22, 48] for more details on this
theory and to [40] for a good overview of the problems and applications.
The basic construction scheme is to consider an infinite dimensional manifold
endowed with smooth group operations. Such a Lie group is locally diffeomorphic
to an infinite-dimensional vector space which can be a Fréchet space (a locally convex
space which is complete with respect to a translation invariant distance), a Banach
space (where the distance comes from a norm) or a Hilbert space (where the norm is
derived from a scalar product). We talk about Fréchet, Banach or Hilbert Lie groups,
respectively. Extending differential calculus from Rn to Banach and Hilbert spaces
is straightforward, but this is not so simple for Fréchet spaces. In particular, the dual
of a Fréchet space need not be Fréchet, which means that some extra care must be
taken when defining differential forms. Moreover, some important theorems such as
the inverse function theorem hold for Banach spaces but not necessarily for Fréchet
spaces.
For instance, the set Diff k (M) of C k diffeomorphisms of a compact manifold
M is a Banach manifold and the set of Sobolev H s diffeomorphisms Diff s (M) is
a Hilbert manifold (if s > dimM/2). However, these are no-classical “Lie groups”
since one loses derivatives when differentiating the composition and inversion maps.
To obtain the complete smoothness of the composition and inversion maps, one has
to go to infinity, but the Banach structure is lost in the process [40, p.12] and we are
left with Diff ∇ (M) being only a Fréchet Lie group. Some additional structure can be
obtained by considering the sequence of Diff k (M) spaces as a succession of dense
inclusions as k goes to infinity: this the Inverse Limit of Banach (ILB)-Lie group
setting. Likewise, the succession of dense inclusions of Sobolev H s diffeomorphisms
give rise to the Inverse Limit of Hilbert (ILH)-Lie group setting.
As the diffeomorphisms groups considered are Fréchet but not Banach, the usual
setting of infinite dimensional Lie groups is the general framework of Fréchet man-
ifolds. This implies that many of the important properties which are true in finite
dimension do not hold any more for general infinite dimensional Lie groups [41].
First, there is no implicit or inverse function theorem (except Nash-Moser type
theorems.) This implies for instance that the log-map (the inverse of the exponential
map) may not be smooth even if the differential of the exponential map is the identity.
Second, the exponential map is not in general a diffeomorphism from a neigh-
borhood of zero in the Lie algebra onto a neighborhood of the identity in the group.
This means that it cannot be used as a local chart to work on the manifold. For
instance in Diff s (M), in every neighborhood of the identity there may exist diffeo-
morphisms which are not the exponential of an H s vector field. A classical example
of the non-surjectivity of the exponential map is the following function in Diff(S1 )
[30]:

fn,ξ (η) = η + κ/n + ξ sin2 (nη). (9.8)


258 M. Lorenzi and X. Pennec

This function can be chosen as close as we want to the identity by opportunely


dimensioning ξ and η. However, it can be shown that it cannot be reached by any
one-parameter subgroup, and therefore the Lie group exponential is not a local dif-
feomorphisms of Diff(S1 ).
This example is quite instructive and shows that this theoretical problem might
actually be a very practical advantage: the norm of the k-th derivative of fn,ξ is
exploding when k is going to infinity, which shows that we would rather want to
exclude this type of diffeomorphisms from the group under consideration.

9.5.3 Riemannian Structure and Stationary Velocity Fields

In the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework


[48], a different construction is leading to a more restricted subgroup of diffeo-
morphisms which is more rational from the computational point of view. One first
chooses a Hilbert norm on the Lie Algebra which turn it into an admissible Hilbert
space. Admissible means that it can be embedded into the space of vector fields
which are bounded and vanishing at infinity, as well as all the first order derivatives.
Typically, this is a Sobolev norm of a sufficiently high order. Then, one restricts to the
subgroup of diffeomorphisms generated by the flow of integrable sequences of such
vector fields for a finite time. To provide this group with a Riemannian structure, a
right invariant metric is chosen. A first reason for choosing right translation is that it
is simply a composition which does not involve a differential operator as for the left
translation. A second reason is that the resulting metric on the group is generating an
invariant metric on the object space with a right action. One can show that the group
provided with this right-invariant metric is a complete metric space [44, 48]: the
choice of the norm on the Lie algebra is specifying the subgroup of diffeomorphisms
which are reachable, i.e. which are at a finite distance.
In the SVF setting, the fact that the flow in an autonomous ODE allows us to
generalize efficient algorithms such as the scaling and squaring algorithm: given an
approximation exp(νY ) = id + νY for small vector fields νY , the exponential of a
SVF Y can be efficiently and simply computed by recursive compositions:
⎥ ⎫ ⎥ ⎫ ⎥ ⎥ ⎫⎫2n
Y Y Y
exp(Y ) = exp ◦ exp = exp n .
2 2 2

A second algorithm is at the heart of the efficiency of the optimization algorithms with
SVFs: the Baker-Campbell-Hausdorff (BCH) formula [9] tells us how to approximate
the log of the composition:

BCH(X, νY ) = log(exp(X) ◦ exp(νY ))


1 1
= X + νY + [X, νY ] + [X, [X, νY ]] + . . . .
2 12
9 Discrete Ladders for Parallel Transport in Transformation Groups 259

In order to have a well-posed space of deformations, we need to specify on which


space is modeled the Lie algebra, as previously. This is the role of the regularization
term of the SVF registration algorithms [19, 46] or of the spline parametrization of the
SVF in [6, 32]: this restricts the Lie algebra to the sub-algebra of sufficiently regular
velocity fields. The subgroup of diffeomorphisms considered is then generated by
the flow of these stationary velocity fields and their finite composition. So far, the
theoretical framework is very similar to the LDDMM setting and we can see that
the diffeomorphisms generated by the one-parameter subgroups (the exponential of
SVFs) all belong to the group considered in the LDDMM setting, provided that we
model the Lie algebra on the same admissible Hilbert space. As in finite dimension,
the affine geodesics of the Cartan connections (group geodesics) are metric-free (the
Hilbert metric is only used to specify the space on which is modeled the Lie Algebra)
and generally differ from the Riemannian geodesics of LDDMM.
It is well known that the subgroup of diffeomorphisms generated by this Lie
algebra is significantly larger than what is covered by the group exponential. Indeed,
although our affine connection space is geodesically complete (all geodesics can be
continued for all time without hitting a boundary), there is no Hopf-Rinow theorem
which state that any two points can be joined by a geodesic (metric completeness).
Thus, in general, not all the elements of the group G may be reached by the one-
parameter subgroups. An example in finite dimension is given by SL(2).
However, this might not necessarily results into a problem in the image registration
context since we are not interested in recovering “all” the possible diffeomorphisms,
but only those which lead to admissible anatomical transformations. For instance,
the diffeomorphism on the circle defined above at Eq. (9.8) cannot be reached by
any one-parameter subgroup of S1 . However, since

lim ∅fn,ξ ∅H k ∈ ∇,
k∈∇

this function is not well behaved from the regularity point of view, which is a critical
feature when dealing with image registration.
In practice, we have a spatial discretization of the SVF (and of the deformations)
on a grid, and the temporal discretization of the time varying velocity fields by a fixed
number of time steps. This intrinsically limits the frequency of the deformation below
a kind of “Nyquist” threshold, which prevents these diffeomorphisms to be reached
anyway both by the SVF and by the “discrete” LDDMM frameworks. Therefore,
it seems more importance to understand the impact of using stationary velocity
fields in registration from the practical point of view, than from the theoretical point
of view, because we will have necessarily to deal with the unavoidable numerical
implementation and relative approximation issues.
260 M. Lorenzi and X. Pennec

9.6 Parallel Transport in Diffeomorphic Registration

Continuous and discrete methods for the parallel transport provided in Sects. 9.3
and 9.4 can be applied in the diffeomorphic registration setting, once provided the
appropriate geometrical context (Sect. 9.5). In this section we discuss and illustrate
practical implementations of the parallel transport in diffeomorphic registration, with
special focus on the application of the ladder schemes exposed in Sect. 9.4.

9.6.1 Continuous Versus Discrete Transport Methods

As illustrated in abstract form in [2], parallel transport can be approximated infini-


tesimally by Jacobi fields. Following this intuition, a computational method for the
parallel transport along geodesics of diffeomorphisms provided with a right invariant
metric was proposed in the LDDMM context by [49]. This framework enables to
transport diffeomorphic deformations of point supported and image data, and it was
applied to study the hippocampal shape changes in Alzheimer’s disease [35, 36].
Although it represents a rigorous implementation of the parallel transport, it comes
to the price of the computationally intense scheme. More importantly, it is limited
to the transport along geodesics of the considered right invariant metric, and does
not allow to specify different metrics for longitudinal and inter-subject registrations.
While from the theoretical point of view parallel transporting along a generic curve
can be approximated by the parallel transport on piecewise geodesics, the effective-
ness of the above methods was shown only on LDDMM geodesics, and no general
computational schemes were provided.
The parallel transport in the SVF setting was investigated in [28], in which explicit
formula for the parallel transport with respect to the standard Cartan-Schouten con-
nections (left, right and symmetric) in the case of finite dimensional Lie groups
were derived. Then it was proposed to seamlessly apply these formulas in the infi-
nite dimensional case of the diffeomorphic registration of images. Although further
investigations would be needed to better understand the impact of generalizing to
infinite dimensions the concepts defined for the Lie Group theory in finite dimension,
practical examples of parallel transport of longitudinal diffeomorphisms in synthetic
and real images with respect to the Cartan-Schouten connections showed to be an
effective and simple way to transport tangent SVFs. In particular the parallel trans-
port of the left, right and symmetric Cartan connection defined in Eqs. (9.3), (9.4),
and (9.5) was directly applied to longitudinal deformations and compared against
the discrete transport implemented with the Schild’s ladder. These experiments high-
lighted the central role of the numerical implementation on stability and accuracy of
the methods. For instance, the left and symmetric Cartan transports were much less
stable than the right one because they involve the computation of the Jacobian matrix,
computed here with standard finite differences. More robust numerical schemes to
9 Discrete Ladders for Parallel Transport in Transformation Groups 261

Fig. 9.3 Geometrical schemes in the Schild’s ladder and in the pole ladder. By using the curve C
as diagonal, the pole ladder requires the computation of half times of the geodesics (blue) required
by the Schild’s ladder (red)

compute differential operators on discrete image grids are definitely required to com-
pare them on a fair basis.

9.6.2 Discrete Ladders: Application to Image Sequences

Let Ii (i = 1 . . . n) be a time series of images with the baseline I0 as reference.


Consider a template image T0 , the aim of the procedure is to compute the image Ti
in order to define the transport of the sequence I0 , . . . , Ii in the reference of T0 . In
the sequel, we focus on the transport of a single image I1 .

Algorithm 1: Schild’s ladder for the transport of a longitudinal deformation.


Let I0 and I1 be a series of images, and T0 a reference frame.
1. Compute the geodesic l(φ) in the space I connecting I1 and T0
such that l(0) = I1 , and l(1) = T0 .
2. Define the half-space image l(1/2) = I 1 .
2
3. Compute the geodesic r(ω) connecting I0 and I 1
2
such that r(0) = I0 and r(1) = I 1 .
2
4. Define the transported follow-up image as T1 = r(2).
5. The transported deformation is given by registering the images T0 and T1 .

We assume that a well posed Riemannian metric is given on the space of images.
This could be L2 , Hk or the metric induced on the space of images by the action of
diffeomorphisms of a well chosen right-invariant metric (LDDMM).
262 M. Lorenzi and X. Pennec

Schild’s ladder can be naturally translated in the image context (Algorithm 1), by
requiring the computation of two diagonal geodesics.
The pole ladder is similar to the Schild’s one, with the difference of explicitly
using as a diagonal the geodesic C which connects I0 and T0 (Algorithm 2). This
is an interesting property since, given C, it requires the computation of only one
additional geodesic, thus the transport of time series of several images is based on
the same baseline-to-reference curve C (Fig. 9.3).

Algorithm 2: Pole ladder for the transport of a longitudinal deformation.


Let I0 and I1 be a series of images, and T0 a reference frame.
1. Compute the geodesic C(μ) in the space I connecting I0 and T0
such that C(0) = I0 and C(1) = T0 .
2. Define the half-space image C(1/2) = I 1 .
2
3. Compute the geodesic g(η) connecting I1 and I 1
2
such that g(0) = I1 and g(1) = I 1 .
2
4. Define the transported image as T1◦ = g(2)
5. Compute the path p(t) such that p(0) = T0 and p(1) = T1◦ .
The transported deformation is the inverse of the registration of p(0) = T0 to p(1) = T1◦ .

9.6.2.1 Lifting the Transport to Diffeomorphisms

Despite the straightforward formulation, algorithms (1) and (2) require multiple
evaluations of image geodesics, and a consequent a high cost in terms of computation
time and resources if we compute them with registration. Moreover, since we look
for regular transformations of the space, the registration is usually constrained to be
smooth and the perfect match of correspondent intensities in the registered images is
not possible. For instance, the definition of I 1 using the forward deformation on I1 or
2
the backward from T0 would lead to different results. Since we work in computational
anatomy with deformations, it seems more natural to perform the parallel transport
directly in the group of diffeomorphisms

9.6.3 Effective Ladders Within the SVF Setting

Given a pair of images Ii , i ⊂ {0, 1}, the SVF framework parametrizes the diffeomor-
phism τ required to match the reference I0 to the moving image I1 by a SVF u. The
velocity field u is an element of the Lie Algebra g of the Lie group of diffeomorphisms
G, i.e. an element of the tangent space at the identity Tid G. The diffeomorphism τ
belongs to the one parameter subgroup τ = exp(t u) generated by the flow of u.
9 Discrete Ladders for Parallel Transport in Transformation Groups 263

Fig. 9.4 Ladder with the


one parameter subgroups.
The transport exp(Δ(u)) is
the deformation exp(v/2) ◦
exp(u) ◦ exp(−v/2)

We can therefore define the paths in the space of the diffeomorphisms from the one
parameter subgroup parametrization l(φ) = exp(φ · u).
Figure 9.4 illustrate how we can take advantage of the stationarity properties of
the one-parameter subgroup in order to define the following robust scheme:

1. Let I1 = exp(u) → I0 .
2. Compute v = argminv⊂G E (T0 ◦ exp(−v/2), I0 ◦ exp(v/2)), where E is a generic
registration energy functional to be minimized.
The half space image I 1 can be defined in terms of v/2 as exp(−v/2) → T0 or
2
exp(v/2)→I0 . While from the theoretical point of view the two images are identical,
the choice of one of them, or even their mean, introduces a bias in the construction.
The definition of the half step image can be bypassed by relying on the symmetric
construction of the parallelogram.
3. The transformation from I1 to I 1 is ω = exp(v/2) ◦ exp(−u) and the symmetry
2
leads to exp (Δ(u)) = exp(v/2) ◦ ω−1 = exp(v/2) ◦ exp(u) ◦ exp(−v/2).
The transport of the deformation τ = exp(u) can be therefore obtained through
the conjugate action operated by the deformation parametrized by v/2.
Since the direct computation of the conjugation by composition is potentially
biased by the spatial discretization, we propose a numerical scheme to more robustly
evaluate the transport directly in the Lie Algebra.

9.6.3.1 BCH Formula for the Conjugate Action

The Baker Campbell Hausdorff (BCH) formula was introduced in the SVF diffeomor-
phic registration in [9] and provides an explicit way to compose diffeomorphisms
parametrized by SVFs by operating in the associated Lie Algebra. More specifi-
cally, if v, u are SVFs, then exp(v) ◦ exp(u) = exp(w) with w = BCH(v, u) =
v + u + 21 [v, u] + 12
1
[v, [v, u]] − 12
1
[u, [v, u]] + . . .. In particular, for small u, the
computation can be truncated to any order to obtain a valid approximation for the
composition of diffeomorphisms. Applying the truncate BCH to the conjugate action
264 M. Lorenzi and X. Pennec

leads to
1
ΔBCH (u) ← u + [v/2, u] + [v/2, [v/2, u]]. (9.9)
2
To establish this formula, let consider the following second order truncation of the
BCH formula
1 1 1
BCH((v/2, u) ← v/2 + u + [v/2, u] + [v/2, [v/2, u]] − [u, [v/2, u]].
2 12 12
The composition
ΔBCH (u) = BCH (v/2, BCH(u, −v/2))

is
1
Δ(u)v = v/2 + BCH(u, −v/2) + [v/2, BCH(u, −v/2)]
⎬ ⎭  2
⎬ ⎭ 
A
B
1 1
+ [v/2, [v/2, BCH(u, −v/2)] − [BCH(u, −v/2), [v/2, BCH(u, −v/2)]] .
12
⎬ ⎭  ⎬ 12 ⎭ 
C D

The second order truncation of the four terms is:


1 1 1
A ← u + [u, −v/2] + [u, [u, −v/2]] − [−v/2, [u, −v/2]],
2 12 12
1 1
B ← [v/2, u] + [v/2, [u, −v/2]],
2 4
1
C← [v/2, [v/2, u]],
12
1 1
D ← − [u, [v/2, u]] + [v/2, [v/2, u]].
12 12
From the additive and anticommutative properties of the Lie bracket, adding the four
terms leads to (9.9).

9.6.3.2 Iterative Computation of the Ladder

Once defined the formula for the computation of the ladder, we need a consistent
scheme for the iterative construction along trajectories. We recall that the transport by
geodesic parallelograms holds only if both sides of the parallelogram are sufficiently
small, which in our case means that both longitudinal and inter-subject vectors must
be small. This is not the case in practice, since the inter-subject deformation is
usually very large. By definition, the ladder requires to scale down vectors to a
9 Discrete Ladders for Parallel Transport in Transformation Groups 265

sufficiently small neighborhood, in order to correctly approximate the transport by


parallelograms.
From the theoretical point of view, the degree of approximation of the ladder is
approximately proportional to the curvature of the space of deformations. This can be
seen by the higher order terms that we dropped off in the proof of Sect. 9.4.2, which
are all derivatives of the Christoffel symbols. While on a linear space the ladder is
the exact parallel transport, when working on curved spaces the error resulting from
the non-infinitesimal geodesic parallelogram is proportional to the distance between
the points.
From the numerical point of view, we notice that Formula (9.9) requires the
computation of the Lie brackets of the velocity fields. Lie brackets involve the dif-
ferentiation of the vector which is usually computed on images by finite differences,
and which are know to be very sensitive to noise and to be unstable in case of large
deformations.
For all these reasons we propose the following iterative scheme based on the
properties of SVFs. To provide a sufficiently small vector for the computation of the
conjugate we observe that

exp(v) ◦ exp(u) ◦ exp(−v)


⎜v ⎜v ⎜ v ⎜ v
= exp ◦ . . . ◦ exp ◦ exp(u) ◦ exp − ◦ . . . ◦ exp −
n n n n
The conjugation can then be recursively computed in the following way:
1. Scaling step. Find n such that v/n is small.
⎟   ⎟⎟
2. Ladder step. Compute w = u + nv , u + 21 nv , nv , u .
3. Let u = w.
4. Iterate the steps 2 and 3 n times.
The BCH formula allows to perform the transport directly in the Lie algebra and
avoids multiple exponentiation and interpolations, thus reducing the bias introduced
by the numerical approximations. Moreover, this method preserves the original “lad-
der” formulation, operated along the inter-subject geodesic exp(tv). In fact it iterates
the construction of the ladder along the path exp(tv) over small steps of size exp( nv ).
The stability of the proposed method critically depends from the initial scaling
step n, which determines the step-size of the numerical scheme. Ideally the step-
size should depend on the curvature, and should be therefore small enough in order
to minimize the error in case of highly curved space. For this purpose, given the
image domain Γ, we define a global scaling factor n in order to guarantee that the
given SVF stays sufficiently close to 0, i.e. in order to satisfy the global condition
maxx⊂Γ ∅v(x)∅/n < ν, with ν = 0.5 → voxel_size. This condition ensures reasonably
small SVFs, and thus enables the iterative construction of the parallelogram in small
neighborhoods.
266 M. Lorenzi and X. Pennec

9.6.4 Pole Ladder for Estimating Longitudinal Changes


in Alzheimer’s Disease

We provide here an application of the pole ladder for the estimation of a group-wise
model of the longitudinal changes in a group of patients affected by Alzheimer’s
disease (AD). AD is a neurodegenerative pathology of the brain, characterized by
the co-occurrence of different phenomena, starting from the deposition of amy-
loid plaques and neurofibrillary tangles, to the development of functional loss and
finally to cell deaths [20]. In particular brain atrophy detectable from magnetic res-
onance imaging (MRI) is currently considered as a potential outcome measure for
the monitoring of the disease progression. Structural atrophy was shown to strongly
correlate with cognitive performance and neuropsychological scores, and character-
izes the progression from pre-clinical to pathological stages [20]. For this reason, the
development of reliable atlases of the pathological longitudinal evolution of the brain
is of paramount importance for improving the understanding of the pathology.
A preliminary approach to the group-wise analysis of longitudinal morphological
changes in AD consists in performing the longitudinal analysis after the subject-
to-template normalization [13, 43]. A key issue here is the different nature of the
changes occurring at the intra-subject level, which reflects the biological phenomena
of interest, and the changes across different subjects, which are usually large and
not related to any biological process. In fact, the inter-subject variability is a scale
of magnitude higher than the more subtle longitudinal subject-specific variations. To
provide a more sensitive quantification of the longitudinal dynamics, the intra-subject
changes should be modeled independently from the subject-to-template normaliza-
tion, and only transported in the common reference for statistical analysis afterward.
Thus, novel techniques such as the parallel transport of longitudinal deformations
might lead to better accuracy and precision for the modeling and quantification of
longitudinal pathological brain changes.

9.6.4.1 Data Analysis and Results

Images corresponding to the baseline I0 and the one-year follow-up I1 scans were
selected for 135 subjects affected by Alzheimer’s disease. For each subject i, the
pairs of scans were rigidly aligned. The baseline was linearly registered to a reference
template and the parameters of the transformation were applied to I1i . Finally, for each
subject, the longitudinal changes were measured by non-linear registration using the
LCC-Demons algorithm [25].
The resulting deformation fields τi = exp(vi ) were transported with the pole
ladder (BCH scheme) in the template reference along the subject-to-template defor-
mation. The group-wise longitudinal progression was modeled as the mean of the
transported SVFs vi . The areas of significant longitudinal changes were investi-
gated by one-sample t-test on the group of log-Jacobian scalar maps corresponding
9 Discrete Ladders for Parallel Transport in Transformation Groups 267

Fig. 9.5 One year structural changes for 135 Alzheimer’s patients. a Mean of the longitudinal
SVFs transported in the template space with the pole ladder. We notice the lateral expansion of the
ventricles and the contraction in the temporal areas. b T-statistic for the correspondent log-Jacobian
values significantly different from 0 (p < 0.001 FDR corrected). c T-statistic for longitudinal
log-Jacobian scalar maps resampled from the subject to the template space. Blue color significant
expansion, Red color significant contraction. The figure is reproduced from [27]

to the transported deformations, in order to detect the areas of measured expan-


sion/contraction significantly different from zero.
For the sake of comparison, the one sample t-statistic was tested on the subject
specific longitudinal log-Jacobian scalar maps warped into the template space along
the subject-to-template deformation. This is the classical transport used in tensor’s
based morphometry studies [4].
Figure 9.5 shows a detail from the mean SVF from the transported one-year
longitudinal trajectories. The field flows outward from the ventricles indicates a
pronounced enlargement. Moreover, we notice an expansion in the temporal horns
of the ventricles as well as a consistent contracting flow in the temporal areas. The
same effect can be statistically quantified by evaluating the areas where the log-
Jacobian maps are statistically different from zero. The areas of significant expansion
are located around the ventricles and spread in the CSF areas, while a significant
contraction is appreciable in the temporal lobes, hippocampi, parahippocampal gyrus
and in the posterior cingulate. The statistical result is in agreement with the one
provided by the simple scalar interpolation of the longitudinal subject specific log-
268 M. Lorenzi and X. Pennec

Fig. 9.6 Apparent relative volume changes encoded by the average longitudinal trajectory com-
puted with the pole ladder (Fig. 9.5). The trajectory describes a pattern of apparent volume gain in
the CSF areas, and of apparent volume loss in temporal areas and around the ventricles

Jacobian maps. In fact we do not experience any substantial loss of localization


power by transporting SVFs instead of scalar log-Jacobian maps. However by parallel
transporting we preserve also the multidimensional information of the SVFs that, as
experienced in [26], potentially leads to more powerful voxel-by-voxel comparisons
than the ones obtained with univariate tests on scalars.
Finally, Fig. 9.6 shows that the apparent volume changes associated to the average
trajectory computed with the pole ladder describe biologically plausible dynamics
of longitudinal atrophy. We notice that the estimated one-year longitudinal trajectory
is associated to local volume changes ranging from +12 % for the expansion of the
ventricles, to 5 % for the volume loss of the hippocampi.
We recall that the apparent expansion of the CSF areas is detectable thanks to
the diffeomorphic registration constraint. In fact, since the deformation is spatially
smooth, the apparent volume loss (e.g. brain atrophy) detectable in the image is mod-
eled as voxels shrinkage, which is associated to the enlargement of the surrounding
areas (e.g. ventricles).

9.7 Conclusions

This chapter illustrates the principles of parallel transporting in transformation


groups, with particular focus on discrete transport methods. The use of discrete
transport is motivated from both theoretical and practical point of view. In fact, dis-
crete methods such as the pole ladder are based only on the computation of geodesics,
and thus they do not require the explicit knowledge of the connection of the space.
9 Discrete Ladders for Parallel Transport in Transformation Groups 269

This is a rather interesting characteristic that enables to employ the ladder without
requiring the design of any additional tool outside geodesics.
From the practical point of view, discrete methods can alleviate the numerical
problems arising from the discretization of continuous functional on finite grids,
and thus provide feasible and numerically stable alternatives to continuous transport
approaches. The application shown in Sect. 9.6.4 is a promising example of the
potential of such approaches when applied to challenging problems such as the
estimation of longitudinal atlases in diffeomorphic registration.
As shown in Sect. 9.4 the construction of the ladder holds in sufficiently small
neighborhoods. From the practical point of view this is related to the choice of an
appropriate step size for the iterative scheme proposed in Sect. 9.6, and future studies
are required in order to investigate the impact of the step size from the numerical
point of view.
Finally, future studies aimed to directly compare discrete versus continuous
approaches might shed more light on the theoretical and numerical properties of
different methods of transport.

References

1. Ardekani, S., Weiss, R.G., Lardo, A.C., George, R.T., Lima, J.A.C., Wu, K.C., Miller, M.I.,
Winslow, R.L., Younes, L.: Cardiac motion analysis in ischemic and non-ischemic cardiomy-
opathy using parallel transport. In: Proceedings of the Sixth IEEE International Conference on
Symposium on Biomedical Imaging: From Nano to Macro, ISBI’09, pp. 899–902. IEEE Press,
Piscataway (2009)
2. Arnold, V.I.: Mathematical Methods of Classical Mechanics, vol. 60. Springer, New York
(1989)
3. Arsigny, V., Commowick, O., Pennec, X., Ayache, N.: A log-euclidean framework for statistics
on diffeomorphisms. In: Proceedings of Medical Image Computing and Computer-Assisted
Intervention - MICCAI, vol. 9, pp. 924–931. Springer, Heidelberg (2006)
4. Ashburner, J., Ridgway, G.R.: Symmetric diffeomorphic modeling of longitudinal structural
MRI. Front. Neurosci. 6, (2012)
5. Ashburner, J., Friston, K.J.: Voxel-based morphometry—the methods. NeuroImage 11, 805–21
(2000)
6. Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1), 95–113
(2007)
7. Avants, B., Anderson, C., Grossman, M., Gee, J.: Spatiotemporal normalization for longitudinal
analysis of gray matter atrophy in frontotemporal dementia. In Ayache, N., Ourselin, S., Maeder,
A. (eds.) Medical Image Computing and Computer-Assisted Intervention, MICCAI, pp. 303–
310. Springer, Heidelberg (2007)
8. Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings
via geodesic flows of diffeomorphisms. Int. J. Comput. Vis. 61(2), 139–157 (2005)
9. Bossa, M., Hernandez, M., Olmos, S.: Contributions to 3d diffeomorphic atlas estimation:
application to brain images. In: Proceedings of Medical Image Computing and Computer-
Assisted Intervention- MICCAI, vol. 10, pp. 667–74 (2007)
10. Bossa, M.N., Zacur, E., Olmos, S.: On changing coordinate systems for longitudinal tensor-
based morphometry. In: Proceedings of Spatio Temporal Image Analysis Workshop (STIA),
(2010)
270 M. Lorenzi and X. Pennec

11. Cartan, E., Schouten, J.A.: On the geometry of the group-manifold of simple and semi-simple
groups. Proc. Akad. Wekensch (Amsterdam) 29, 803–815 (1926)
12. Charpiat, G.: Learning shape metrics based on deformations and transport. In: Second Work-
shop on Non-Rigid Shape Analysis and Deformable Image Alignment, Kyoto, Japon (2009)
13. Chetelat, G., Landeau, B., Eustache, F., Mezenge, F., Viader, F., de la Sayette, V., Desgranges,
B., Baron, J.-C.: Using voxel-based morphometry to map the structural changes associated
with rapid conversion to mci. NeuroImage 27, 934–46 (2005)
14. Thompson, D.W.: On growth and form by D’Arcy Wentworth Thompson. University Press,
Cambridge (1945)
15. Davis, B.C., Fletcher, P.T., Bullit, E., Joshi, S.: Population shape regression from random design
data. In: ICCV vol.4, pp. 375–405 (2007)
16. do Carmo, M.P.: Riemannian Geometry. Mathematics. Birkhäuser, Boston, Basel, Berlin (1992)
17. Durrleman, S., Pennec, X., Trouvé, A., Gerig, G., Ayache, N.: Spatiotemporal atlas estimation
for developmental delay detection in longitudinal datasets. In: Medical Image Computing and
Computer-Assisted Intervention—MICCAI, vol. 12, pp. 297–304 (2009)
18. Helgason, S.: Differential Geometry, Lie groups, and Symmetric Spaces. Academic Press, New
York (1978)
19. Hernandez, M., Bossa, M., Olmos, S.: Registration of anatomical images using paths of diffeo-
morphisms parameterized with stationary vector field flows. Int. J. Comput. Vis. 85, 291–306
(2009)
20. Jack, C.R., Knopman, D.S., Jagust, W.J., Shaw, L.M., Aisen, P.S., Weiner, M.W., Petersen, R.C.,
Trojanowski, J.Q.: Hypothetical model of dynamic biomarkers of the alzheimer’s pathological
cascade. Lancet Neurol. 9, 119–28 (2010)
21. Joshi, S., Miller, M.I.: Landmark matching via large deformation diffeomorphisms. IEEE Trans.
Image Process. 9(8), 1357–70 (2000)
22. Khesin, B.A., Wendt, R.: The Geometry of Infinite Dimensional Lie groups, volume 51 of
Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in
Mathematics. Springer (2009)
23. Kheyfets, A., Miller, W., Newton, G.: Schild’s ladder parallel transport for an arbitrary con-
nection. Int. J. Theoret. Phys. 39(12), 41–56 (2000)
24. Kolev, B.: Groupes de Lie et mécanique. http://www.cmi.univ-mrs.fr/kolev/. Notes of a Master
course in 2006–2007 at Université de Provence (2007)
25. Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X.: Lcc-demons: a robust and accurate sym-
metric diffeomorphic registration algorithm. NeuroImage 1(81), 470–83 (2013)
26. Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X.: Mapping the effects of Aε1−42 levels on the
longitudinal changes in healthy aging: hierarchical modeling based on stationary velocity fields.
In: Medical Image Computing and Computer-Assisted Intervention—MICCAI, pp. 663–670,
(2011)
27. Lorenzi, M., Pennec, X.: Efficient parallel transport of deformations in time series of images:
from Schild’s to pole ladder. J. Math. Imaging Vis. (2013) (Published online)
28. Lorenzi, M., Pennec, X.: Geodesics, parallel transport and one-parameter subgroups for dif-
feomorphic image registration. Int. J. Comput. Vis.—IJCV 105(2), 111–127 (2012)
29. Lorenzi, M., Ayache, N., Pennec, X.: Schild’s ladder for the parallel transport of deformations
in time series of images. Inf. Process. Med. Imaging—IPMI 22, 463–74 (2011)
30. Milnor, J.: Remarks on infinite-dimensional Lie groups. In: Relativity, Groups and Topology,
pp. 1009–1057. Elsevier Science Publishers, Les Houches (1984)
31. Misner, C.W., Thorne, K.S., Wheeler, J.A.: Gravitation. W.H. Freeman and Compagny , San
Francisco, California (1973)
32. Modat, M., Ridgway, G.R., Daga, P., Cardoso, M.J., Hawkes, D.J., Ashburner, J., Ourselin, S.:
Log-Euclidean free-form deformation. In: Proceedings of SPIE Medical Imaging 2011. SPIE,
(2011)
33. Pennec, X., Arsigny, V.: Exponential barycenters of the canonical cartan connection and invari-
ant means on Lie groups. In: Barbaresco, F., Mishra, A., Nielsen, F. (eds.) Matrix Information
Geometry. Springer, Heidelberg (2012)
9 Discrete Ladders for Parallel Transport in Transformation Groups 271

34. Postnikov, M.M.: Geometry VI: Riemannian Geometry. Encyclopedia of mathematical science.
Springer, Berlin (2001)
35. Qiu, A., Younes, L., Miller, M., Csernansky, J.G.: Parallel transport in diffeomorphisms dis-
tinguish the time-dependent pattern of hippocampal surface deformation due to healthy aging
and dementia of the Alzheimer’s type. NeuroImage, 40(1):68–76 (2008)
36. Qiu, A., Albert, M., Younes, L., Miller, M.: Time sequence diffeomorphic metric mapping and
parallel transport track time-dependent shape changes. NeuroImage 45(1), S51–60 (2009)
37. Rao, A., Chandrashekara, R., Sanchez-Hortiz, G., Mohiaddin, R., Aljabar, P., Hajnal, J., Puri, B.,
Rueckert, D.: Spatial trasformation of motion and deformation fields using nonrigid registration.
IEEE Trans. Med. Imaging 23(9), 1065–76 (2004)
38. Riddle, W.R., Li, R., Fitzpatrick, J.M., DonLevy, S.C., Dawant, B.M., Price, R.R.: Character-
izing changes in mr images with color-coded jacobians. Magn. Reson. Imaging 22(6), 769–77
(2004)
39. Schild, A.: Tearing geometry to pieces: More on conformal geometry. unpublished lecture at
Jan 19 1970 Princeton University relativity seminar (1970)
40. Schmid, R.: Infinite dimensional lie groups with applications to mathematical physics. J. Geom.
Symmetry Phys. 1, 1–67 (2004)
41. Schmid, R.: Infinite-dimensional lie groups and algebras in mathematical physics. Adv. Math.
Phys. 2010, 1–36 (2010)
42. Subbarao, R.: Robust Statistics Over Riemannian Manifolds for Computer Vision. Graduate
School New Brunswick, Rutgers The State University of New Jersey, New Brunswick, (2008)
43. Thompson, P., Ayashi, K.M., Zubicaray, G., Janke, A.L., Rose, S.E., Semple, J., Herman,
D., Hong, M.S., Dittmer, S.S., Dodrell, D.M., Toga, A.W.: Dynamics of gray matter loss in
alzheimer’s disease. J. Neurosci. 23(3), 994–1005 (2003)
44. Trouvé, A.: Diffeomorphisms groups and pattern matching in image analysis. Int. J. Comput.
Vis. 28(3), 213–21 (1998)
45. Twining, C., Marsland, S., Taylor, C.: Metrics, connections, and correspondence: the setting
for groupwise shape analysis. In: Proceedings of the 8th International Conference on Energy
Minimization Methods in Computer Vision and Pattern Recognition, EMMCVPR’11, pp. 399–
412. Springer, Berlin, Heidelberg (2011)
46. Vercauteren, T., Pennec, X., Perchant, A., Ayache, N.: Symmetric log-domain diffeomorphic
registration: a demons-based approach. In: Medical Image Computing and Computer-Assisted
Intervention—MICCAI. Lecture Notes in Computer Science, vol. 5241, pp. 754–761. Springer,
Heidelberg (2008)
47. Wei, D., Lin, D., Fisher, J.: Learning deformations with parallel transport. In: ECCV, pp. 287–
300 (2012)
48. Younes, L.: Shapes and diffeomorphisms. Number 171 in Applied Mathematical Sciences.
Springer, Berlin (2010)
49. Younes L.: Jacobi fields in groups of diffeomorphisms and applications. Q. Appl. Math. pp.
113–134 (2007)
Chapter 10
Diffeomorphic Iterative Centroid
Methods for Template Estimation
on Large Datasets

Claire Cury, Joan Alexis Glaunès and Olivier Colliot

Abstract A common approach for analysis of anatomical variability relies on the


estimation of a template representative of the population. The Large Deformation
Diffeomorphic Metric Mapping is an attractive framework for that purpose. How-
ever, template estimation using LDDMM is computationally expensive, which is a
limitation for the study of large datasets. This chapter presents an iterative method
which quickly provides a centroid of the population in the shape space. This centroid
can be used as a rough template estimate or as initialization of a template estima-
tion method. The approach is evaluated on datasets of real and synthetic hippocampi
segmented from brain MRI. The results show that the centroid is correctly centered
within the population and is stable for different orderings of subjects. When used as
an initialization, the approach allows to substantially reduce the computation time
of template estimation.

C. Cury (B) · O. Colliot


UPMC Univ Paris 06, UM 75, Sorbonne Universités, ICM, F-75013 Paris, France
e-mail: [email protected]
C. Cury · O. Colliot
UMR 7225, CNRS, ICM, F-75013, Paris, France
C. Cury · O. Colliot
U1127, Inserm, ICM, F-75013, Paris, France
C. Cury · O. Colliot
ICM, Institut du Cerveau et de la Moëlle épinière, 47 bd de l’hôpital, Paris, France
C. Cury · O. Colliot
Aramis project-team, Inria Paris-Rocquencourt, Paris, France
J. A. Glaunès
MAP5, Université Paris Descartes, Sorbonne Paris Cité, France

F. Nielsen (ed.), Geometric Theory of Information, 273


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_10,
© Springer International Publishing Switzerland 2014
274 C. Cury et al.

10.1 Introduction

Large imaging datasets are being increasingly used in neuroscience, thanks to


the wider availability of neuroimaging facilities, the development of computing
infrastructures and the emergence of large-scale multi-center studies. Such large-
scale datasets offer increased statistical power which is crucial for addressing ques-
tions such as the relationship between anatomy and genetics or the discovery of new
biomarkers using machine learning techniques for instance.
Computational anatomy aims at developing tools for the quantitative analysis of
variability of anatomical structures, its variation in healthy and pathological cases and
relations between functions and structures [1]. A common approach in computational
anatomy is template-based analysis, where the idea is to compare anatomical objects
by analyzing their variations relatively to a common template. These variations are
analyzed using the ambient space deformations that match each individual structure
to the template.
A common requirement is that transformations must be diffeomorphic in order to
preserve the topology and to consistently transform coordinates. The Large Deforma-
tion Diffeomorphic Metric Mapping (LDDMM) framework [2, 3] has been widely
used for the study of the geometric variation of human anatomy, of intra-population
variability and inter-population differences. It focuses the study on the spatial trans-
formations which can match subject’s anatomies one to another, or one to a template
structure which needs to be estimated. These transformations not only provide a
diffeomorphic correspondence between shapes, but also define a metric distance in
shape space.
Several methods have been proposed to estimate templates in the LDDMM frame-
work [4–8]. Vaillant et al. proposed in 2004 [4] a method based on geodesic shooting
which iteratively updates a shape by shooting towards the mean of directions of
deformations from this shape to all shapes of the population. The method proposed
by Glaunès and Joshi [5] starts from the whole population and estimates a template
by co-registering all subjects. The method uses a backward scheme: deformations
are defined from the subjects to the template. The method optimizes at the same
time the deformations between subjects and the template, and the template itself.
The template is composed, in the space of currents (more details on Sect. 10.2.2),
by all surfaces of the population. A different approach was proposed by Durrleman
et al. [6, 7]. The method initializes the template with a standard shape, in practice it
is often an ellipsoid. The method uses a forward scheme: deformations are defined
from the template to the subjects. Again, it optimizes at the same time the defor-
mations and the template. The template is composed by one surface which presents
the same configuration as the initial ellipsoid. The method presented by Ma et al. [8]
introduces an hyper template which is an extra fixed shape (which can be a subject of
the population). The method aims at optimizing at the same time deformations from
the hyper template to the template and deformations from the template to subjects of
the population. The template is optimized via the deformation of the hyper template,
not directly.
10 Diffeomorphic Iterative Centroid Methods 275

A common point of all these methods is that they need a surface matching
algorithm, which is very expensive in terms of computation time in the LDDMM
framework. When no specific optimization is used, computing only one match-
ing between two surfaces, each composed of 3000 vertices, takes approximately
30–40 min. Then, computing a template composed of one hundred such surfaces
until convergence can take a few days or some weeks. This is a limitation for the
study of large databases. Different strategies can be used to reduce computation time.
GPU implementation can substantially speed up the computation of convolutions that
are heavily used in LDDMM deformations. Matching pursuit on current can also be
used to reduce the computation time [9]. Sparse representations of deformations
allow to reduce the number of optimized parameters of the deformations [7].
Here, we propose a new approach to reduce the computation time called diffeo-
morphic iterative centroid using currents. The method provides in N − 1 steps (with
N the number of shapes of the population) a centroid already correctly centered
within the population of shapes. It increases the convergence speed of the template
estimation by providing an initialization that is closer to the target.
Our method has some close connections with more general iterative methods to
compute means on Riemannian manifolds. For example Arnaudon et al. [10] defined
a stochastic iterative method which converges to the Fréchet mean of the set of points.
Ando et al. [11] gave a recursive definition of the mean of positive definite matrices
which verifies important properties of geometric means. However these methods
require a large number of iterations (much larger than the number of points of the
dataset), while in our case, due to the high computational cost of matchings, we aim
at limiting as much as possible the number of iterations.
The chapter is organized as follows. First, we present the mathematical framework
of LDDMM and currents (Sect. 10.2). Section 10.3.1 then introduces the template
estimation and the iterative centroid method. In Sect. 10.4.1, we evaluate the approach
on datasets of real and synthetic hippocampi extracted from brain magnetic resonance
images (MRI).

10.2 Notation and Mathematical Setup

For the Diffeomorphic Iterative Centroid method, we use the LDDMM frame-
work (Sect. 10.2.1) to quantify the difference between shapes. To model surfaces
of the population, we use the framework of currents (Sect. 10.2.2) which does not
assume point-to-point correspondences.

10.2.1 Large Diffeomorphic Deformations

The Large Diffeomorphic Deformation Metric Mapping framework allows quantify-


ing the difference between shapes and provides a shape space representation: shapes
of the population are seen as points in an infinite dimensional smooth manifold,
276 C. Cury et al.

providing a continuum between shapes of the population. In this framework a


diffeomorphism deforms the whole space, not only a shape.
Diffeomorphisms as flows of vector fields In the LDDMM framework, deformation
maps ϕ : R3 ⊂ R3 are generated by integration of time-dependent vector fields
v(x, .) in an Hilbert space V , with x ∗ R3 and t ∗ [0, 1]. If v(x, t) is regular
enough, i.e. if we consider the vector fields (v(·, t))t∗[0,1] in L 2 ([0, 1], V ), where
V is a Reproducing Kernel Hilbert Space (R. K. H. S.) embedded in the space of
C 1 (R3 , R3 ) vector fields vanishing at infinity, then the transport equation:

dφv
dt (x, t)= v(φv (x, t), t) ∈t ∗ [0, 1]
(10.1)
φv (x, 0) = x ∈x ∗ R3

has a unique solution, and one sets ϕv = φv (·, 1) the diffeomorphism induced by
v(x, t). The induced set of diffeomorphisms AV is a subgroup of the group of C 1
diffeomorphisms. To enforce velocity fields to stay in this space, one must control
the energy
 1
E(v) := v(·, t)2V dt. (10.2)
0

Metric structure on the diffeomorphisms group The induced subgroup of diffeo-


morphisms AV is equipped with a right-invariant metric defined by the rules:
∈ϕ, ψ ∗ AV ,

 −1
D(ϕ, ψ) = D(I d,
ϕ ≡ ψ) 
1 (10.3)
D(I d, ϕ) = inf 0 v(·, t)V dt ; v ∗ L 2 ([0, 1], V ), ϕv = ϕ

D(ϕ, ψ) represents the shortest length of paths connecting ϕ to ψ in the diffeo-


morphisms group. Moreover, as in the classical Riemannian theory, minimizing the
length of paths is equivalent to minimizing their energy, and one has also:
 
D(I d, ϕ)) = inf E(v) ; v ∗ L 2 ([0, 1], V ), ϕv = ϕ (10.4)

Discrete matching functionals Considering two surfaces S and T , the optimal match-
ing between them is defined in an ideal setting, as the map ϕv minimizing E(v) under
the constraint ϕv (S) = T . In practice such an exact matching is often not feasible and
one writes inexact unconstrained matching functionals which minimize both E(v)
and a matching criterion which evaluates the spatial proximity between ϕv (S) and
T , as we will see in the next section.
In a discrete setting, when the matching criterion depends only on ϕv via the
images ϕv (xi ) of a finite number of points xi (such as the vertices of the mesh S) one
can show that the vector fields v(x, t) which induce the optimal deformation map
10 Diffeomorphic Iterative Centroid Methods 277

can be written via a convolution formula over the surface involving the reproducing
kernel K V of the R.K.H.S. V . This is due to the reproducing property of V ; indeed
V is the closed span of vectors fields of the form K V (x, .)α, and therefore v(x, t)
writes


n
v(x, t) = K V (x, xi (t))αi (t), (10.5)
i=1

where xi (t) = φv (xi , t) are the trajectories of points xi , and αi (t) ∗ R3 are time-
dependent vectors called momentum vectors, which parametrize completely the
deformation. Trajectories xi (t) depend only on these vectors as solutions of the
following system of ordinary differential equations:

d x j (t) 
n
= K V (x j (t), xi (t))αi (t), (10.6)
dt
i=1

for 1 ∇ j ∇ n. This is obtained by plugging formula 10.5 for the optimal velocity
fields into the flow equation 10.1 taken at x = x j . Moreover, the energy E(v) takes
an explicit form as expressed in terms of trajectories and momentum vectors:
 1 
n
E(v) = αi (t)T K V (xi (t), x j (t))α j (t) dt. (10.7)
0 i, j=1

These equations reformulate the problems in a finite dimensional Riemannian


setting. Indeed E(v) appears as the energy of the path t ◦⊂ (xi (t))1∇i∇n in the space
of landmarks Ln = {x = (xi )1∇i∇n , xi →= x j ∈i, j} equipped with local metric
g(x) = K (x)−1 , where K V (x) is the 3n × 3n matrix with block entries K V (xi , x j ),
1 ∇ i, j ∇ n.

Geodesic equations and local encoding As introduced previously, the minimization


of the energy E(v) in matching problems can be interpreted as the estimation of a
length-minimizing path in the group of diffeomorphisms AV , and also additionally as
a length-minimizing path in the space of landmarks when considering discrete prob-
lems. Such length-minimizing paths obey some geodesic equations [4] (distances
are define as in 10.3), which write as follows in the case of landmarks (using matrix
notations):

dx(t)
dt = K V (x(t))α(t)
 (10.8)
dα(t)
dt = − 2 ∅x(t) α(t) K V (x(t))α(t) ,
1 T

Note that the first equation is nothing more than Eq. 10.6 which allows to com-
pute trajectories xi (t) from any time-dependant momentum vectors αi (t), while the
second equation gives the evolution of the momentum vectors themselves. This new
278 C. Cury et al.

set of ODEs can be solved from any initial conditions xi (0), αi (0), which means
that the initial momentum αi (0) fully determine the subsequent time evolution of
the system (since the xi (0) are fixed points). As a consequence, these initial momen-
tum vectors encode all information of the optimal diffeomorphism. This is a very
important point for applications, specifically for group studies, since it allows to
analyse the set of deformation maps from a given template to the observed shapes by
performing statistics on the initial momentum vectors located on the template shape.
We also can use geodesic shooting from initial conditions (xi (0), αi (0)) in order to
generate any arbitrary deformation of a shape in the shape space. We will use this
tool for the construction of our synthetic dataset Data1 (see Sect. 10.4.1).

10.2.2 Currents

The idea of the mathematical object named “currents” is related to the theory of
distributions as presented by Schwartz in 1952 [12], in which distributions are char-
acterized by their action on any smooth functions with compact support. In 1955, De
Rham [13] generalized distributions to differential forms to represent submanifolds,
and called this representation currents. This mathematical object serves to model
geometrical objects using a non parametric representation.
The use of currents in computational anatomy was introduced by J. Glaunés and
M. Vaillant in 2005 [14, 15] and subsequently developed by Durrleman [16] in order
to provide a dissimilarity measure between meshes which does not assume point-
to-point correspondence between anatomical structures. The approach proposed by
Vaillant and Glaunès is to represent meshes as objects in a linear space and supply
it with a computable norm. Using currents to represent surfaces has some benefits.
First it avoids the point correspondence issue: one does not need to define pairs
of corresponding points between two surfaces to evaluate their spatial proximity.
Moreover, metrics on currents are robust to different samplings and topologies and
take into account not only the global shapes but also their local orientations. Another
important benefit is that the space of currents is a vector space, which allows to
consider linear combinations such as means of shapes in the space of currents. This
property will be used in the centroid and template methods that we introduce in the
following.
We limit the framework to surfaces embedded in R3 . Let S be an oriented compact
surface, possibly with boundary. Any smooth and compactly supported differential
2-form ω of R3 —i.e. a mapping x ◦⊂ ω(x) such that for any x ∗ R3 , ω(x) is a
2-form, an alternated bilinear mapping from R3 × R3 to R—can be integrated over S
 
ω= ω(x)(u 1 (x), u 2 (x))dσ(x). (10.9)
S S

where (u 1 (x), u 2 (x)) is an orthonormal basis of the tangent plane at point x, and dσ
the Lebesgue measure on the surface S. Hence one can define a linear form [S] over
10 Diffeomorphic Iterative Centroid Methods 279

the space of 2-forms via the rule [S](ω) := S ω. If one defines a Hilbert metric on
the space of 2-forms such that the corresponding space is continuously embedded
in the space of continuous bounded 2-forms, this mapping will be continuous [14],
which will make [S] an element of the space of 2-currents, the dual space to the space
of 2-forms.
Note that since we are working with 2-forms on R3 , we can use a vectorial
representation via the cross product: for every 2-form ω and x ∗ R3 there exists a
vector ω(x) ∗ R3 such that for every α, β ∗ R3 ,

ω(x)(α, β) = ←ω(x) , α × β≤ = det (α, β, ω(x)), (10.10)

Therefore we can work with vector fields ω instead of 2-forms ω. In the following,
with a slight abuse of notation, we will use ω(x) to represent both the bilinear
alternated form and its vectorial representative. Hence the current of a surface S can
be re-written from Eq. 10.9 as follows:

[S](ω) = ←ω(x) , n(x)≤ dσ(x) (10.11)
S

with n(x) the unit normal vector to the surface: n(x) := u 1 (x) × u 2 (x).
We define a Hilbert metric ←· , ·≤W on the space of vector fields of R3 , and require
the space W to be continuously embedded in C01 (R3 , R3 ). The space of currents we
consider is the space of continuous linear forms on W , i.e. the dual space W ∗ , and
the required embedding property ensures that for a large class of oriented surfaces
S in R3 , comprising smooth surfaces and also triangulated meshes, the associated
linear mapping [S] is indeed a current, i.e. it belongs to W ∗ .
The central object from the computational point of view is the reproducing kernel
of space W , which we introduce here. For any point x ∗ R3 and vector α ∗ R3 one
can consider the Dirac functional δxα : ω ◦⊂ ←ω(x) , α≤ which is an element of W ∗ .
The Riesz representation theorem then states that there exists a unique u ∗ W such
that for all ω ∗ W , ←u , ω≤W = δxα (ω) = ←ω(x) , α≤. u is thus a vector field which
depends on x and linearly on α, and we write it u = K W (·, x)α. Thus we have the
rule

←K W (·, x)α, ω≤W = ←ω(x) , α≤ . (10.12)

Moreover, applying this formula to ω = K W (·, y)β for any other point y ∗ R3
and vector β ∗ R3 , we get

←K W (·, x)α, K W (·, y)β≤W = K W (x, y)β , α (10.13)


= αT K W (x, y)β = δxα , δ βy .
W∗
280 C. Cury et al.

K W (x, y) is a 3 × 3 matrix, and the mapping K W : R3 × R3 ⊂ R3×3 is called


the reproducing kernel of the space W . Now, note that we can rewrite Eq. 10.11 as

[S](ω) = δxn(x) (ω) dσ(x) (10.14)
S

Thus using Eq. 10.13, one can prove that for two surfaces S and T ,
 
←[S] , [T ]≤2W ∗ = n S (x) , K W (x, y)n T (y) ds(x)ds(y) (10.15)
S T

This formula defines the metric we use for evaluating spatial proximity between
shapes. It is clear that the type of kernel one uses fully determines the metric and
therefore will have a direct impact on the behaviour of the algorithms. We use scalar
invariant kernels of the form K W (x, y) = h(x − y2 /σW 2 )I , where h is a real
3
−r
function such as h(r ) = e (gaussian kernel) or h(r ) = 1/(1 + r ) (Cauchy kernel),
and σW a scale factor. In practice this scale parameter has a strong influence on the
results; we will go back to this point later.
We can now define the optimal match between two currents [S] and [T ], which
is the diffeomorphism minimizing the functional

JS,T (v) = γ E(v) + [ϕv (S)] − [T ]2W ∗ (10.16)

This functional is non convex and in practice we use a gradient descent algorithm
to perform the optimization, which cannot guarantee to reach a global minimum.
We observed empirically that local minima can be avoided by using a multi-scale
approach in which several optimization steps are performed with decreasing values
of the width σW of the kernel K W (each step provides an initial guess for the next
one).
In practice, surfaces are given as triangulated meshes, which we discretize
 in
nf
the space of currents W ∗ by combinations of Dirac functionals: [S]  f ∗S δc f ,
where the sum is taken over all triangles f = ( f 1 , f 2 , f 3 ) of the mesh S, and
c f = 21 ( f 1 + f 2 + f 3 ), n f = 21 ( f 2 − f 1 ) × ( f 3 − f 1 ) denote respectively
the center and normal vector of the triangle. Given a deformation map ϕ and a
triangulated surface S, we also approximate its image ϕ(S) by the triangulated mesh
obtained by letting φ act only on the vertices of S. This leads us to the following
discrete formulation of the matching problem:
 1
n
d
JS,T (α) = γ αi (t)T K V (xi (t), x j (t))α j (t) dt
0 i=1

+ T
n ϕ( f ) K W (cϕ( f ) , cϕ( f ⊆ ) )n ϕ( f ⊆ )
f, f ⊆ ∗S
10 Diffeomorphic Iterative Centroid Methods 281
 
+ n g K W (cg , cg⊆ )n g⊆ − 2 T
n ϕ( f ) K W (cϕ( f ) , cg )n g (10.17)
g,g ⊆ ∗T f ∗S,g∗T

where ϕ denotes the diffeomorphism associated to momentum vectors αi (t) and


trajectories xi (t), xi = xi (0) being the vertices of mesh S, and where we have noted
for any face f , ϕ( f ) = (ϕ( f 1 ), ϕ( f 2 ), ϕ( f 3 )). We note 3 important parameters, γ
which controls the regularity of the map, σV which controls the scale in the space of
deformations and σW which controls the scale in the space of currents.

10.3 A Template Estimation for Large Database via LDDMM

10.3.1 Why Building a Template?

A central notion in computational anatomy is the generation of registration maps,


mapping a large set of anatomical data to a common coordinate system to study
intra-population variability and inter-population differences. In this chapter, we use
the method introduced by Glaunès et al. [5] which estimates a template given a
collection of unlabelled points sets or surfaces in the framework of scalar measures
and currents. In our case we use the framework of currents. This method is posed as
a minimum mean squared error estimation problem and uses the metric on the space
of diffeomorphisms. Let Si be N surfaces in R3 (i.e. the whole surface population).
Let [Si ] be the corresponding current of Si , or its approximation by a finite sum of
vectorial Diracs. The problem is formulated as follows:

  N 
  
vˆi , T̂ = arg min T − ϕvi (Si ) 2W ∗ + γ E(vi ) , (10.18)
vi ,T i=1

where the minimization is performed over the spaces L 2 ([0, 1], V ) for the velocity
fields vi and over the space of currents W ∗ for T . The method uses an alternated
optimization i.e. surfaces are successively matched to the template, then the template
is updated and this sequence is iterated until convergence. One can observe that when
ϕi is fixed, the functional is minimized when T is the average of [ϕi (Si )] in space
W ∗:

1 
N
T = ϕvi (Si ) , (10.19)
N
i=1

which makes the optimization with respect to T straightforward. This optimal current
is not a surface itself; in practice it is constituted by the union of all surfaces ϕvi (Si ),
and the N1 factor acts as if all normal vectors to these surfaces were weighted by N1 .
282 C. Cury et al.

At the end of the optimization process however, all surfaces being co-registered, the
ϕ̂vi (Si ) are close to each other, which makes the optimal template T̂ close to being
a true surface.
In practice, we stop the template estimation method after P loops, and with the
datasets we use, P = 7 seems to be sufficient to obtain an adequate template.
As detailed in Sect. 10.2, obtaining a template allows to perform statistical analysis
of the deformation maps via the initial momentum representation to characterize the
population. One can run analysis on momentum vectors such as Principal Component
Analysis (PCA), or estimate an approximation of pairwise diffeomorphic distances
between subjects using the estimated template [17] in order to use manifold learning
methods like Isomap [18].
In the present case, the optimal template for the population
 N is nota true surface
but is defined, in the space of currents, by the mean T̂ = N1 j=1 ϕ̂v j S j . However
this makes no difference from the point of view of statistical analysis, because this
template can be used in the LDDMM framework exactly as if it was a true surface.
One may speed up the estimation process and avoid local minima issues by defin-
ing a good initialization
 N of the optimization process. Standard initialization consists
in setting T = N1 i=1 [Si ], which means that the initial template is defined as the
combination of all unregistered shapes in the population. Alternatively, if one is given
a good initial guess T , the convergence speed of the method can be improved. This
is the primary motivation for the introduction of the iterative centroid method which
we present in the next section.

10.3.2 The Iterative Centroid Method

As presented in the introduction, computing a template in the LDDMM framework


can be highly time consuming, taking a few days or some weeks for large real-world
databases. To increase the speed of the method, one of the key points may be to
start with a good initial template, already correctly centred among shapes in the
population. Of course the computation time of such an initialization method must be
substantially lower than the template estimation itself. The Iterative Centroid method
presented here performs such an initialization with N − 1 pairwise matchings only.
The LDDMM framework, in an ideal setting (exact matching between shapes),
sets the template estimation problem as the computation of a centroid on a Rie-
mannian manifold, which is of finite dimension in the discrete case (we limit our
analysis to this finite dimensional setting in what follows). The Fréchet mean is the
standard way for defining such a centroid and provides the basic inspiration of all
LDDMM template estimation methods. Since our Iterated Centroid method is also
inspired by considerations about computation of centroids in Euclidean space and
their analogues on Riemannian manifolds, we will briefly discuss these ideas in the
following.
10 Diffeomorphic Iterative Centroid Methods 283

Centroid computation on Euclidean and Riemannian spaces If xi , 1 ∇ i ∇ N are


points in Rd , then their centroid is defined as

1 
N
bN = xi . (10.20)
N
i=1

It satisfies also the following:



b N = arg min y − xi 2 . (10.21)
y∗Rd 1∇i∇N

Now, when considering points xi living on a Riemannian manifold M (we assume


M is path-connected and geodesically complete), the definition of b N cannot be used
because M is not a vector space. However the variational characterization of b N has
an analogue, which leads to the definition of the Fréchet mean, also called 2-mean,
which is uniquely defined under some constraints (see [10]) on the relative locations
of points xi in the manifold:

b N = arg min d M (y, xi )2 . (10.22)
y∗M 1∇i∇N

Many mathematical studies (as for example Kendall [19], Karcher [20] Le [21],
Afsari [22, 23]), have focused on proving the existence and uniqueness of the mean,
as well as proposing algorithms to compute it. The more general notion of p-mean
of a probability measure μ on a Riemannian manifold M is defined by:

b = arg min F p (x), F p (x) = d M (x, y) p μ(dy). (10.23)
x∗M M

Arnaudon et al. [10] published in 2012 for p ∧ 1 a stochastic algorithm which


converges almost surely to the p-mean of the probability measure μ. This algorithm
does not require to compute the gradient of the functional F p to minimize. The
authors construct a time inhomogeneous Markov chain by choosing at each step a
random point P with distribution μ and moving the current point X to a new position
along the geodesic connecting X to P. As it will be obvious in the following, our
method shares similarities with this method for the case p = 2, in that it also uses
an iterative process which at each step moves the current position to a new position
along a geodesic. However our method is not stochastic and does not compute the
2-mean of the points. Moreover, our approach stops after N − 1 iterations, while on
the contrary the stochastic method does not ensure to have considered all subjects of
the population after N iterations.
284 C. Cury et al.

Other definitions of centroids in the Riemannian setting can be proposed. The fol-
lowing ideas are more directly connected to our method. Going back to the Euclidean
case, one can observe that b N satisfies the following iterative relation:

b1 = x1
(10.24)
bk+1 = k+1
k
bk + k+1 x k+1 ,
1
1 ∇ k ∇ N − 1,

which has the side benefit that at each step bk is the centroid of the xi , 1 ∇ i ∇ k. This
iterative process has an analogue in the Riemannian case, because one can interpret
the convex combination k+1 k
bk + k+1 1
xk+1 as the point located along the geodesic
1
linking bk to xk+1 , at a distance equal to k+1 of the total length of the geodesic,
which we can write geod(bk , xk+1 , k+1 ). This leads to the following definition in
1

the Riemannian case:



b̃1 = x1
(10.25)
b̃k+1 = geod(b̃k , xk+1 , k+1
1
), 1 ∇ k ∇ N − 1,

Of course this new definition of centroid does not coincide with the Fréchet mean
when the metric is not Euclidean, and furthermore it has the drawback to depend on
the ordering of points xi . Moreover one may consider other iterative procedures such
as computing midpoints between arbitrary pairs of points xi , and then midpoints of
the midpoints, etc. In other words,
 N all procedures that are based on decomposing the
Euclidean equality b N = N1 i=1 xi as a sequence of pairwise convex combinations
lead to possible alternative definitions of centroid in a Riemannian setting. Based on
these remarks, Emery and Mokobodzki [24] proposed to define the centroid not as a
unique point but as the set B N of points x ∗ M satisfying

1 
N
f (x) ∇ f (xi ), (10.26)
N
i=1

for any convex function f on M (a convex function f on M being defined by the


property that its restriction to all geodesics is convex). This set B N takes into account
all centroids obtained by bringing together points xi by all possible means, i.e. recur-
sively by pairs, or by iteratively adding a new point, as explained above (see Fig. 10.2).
Outline of the method The Iterated Centroid method consists roughly in applying
the following procedure: given a collection of N shapes Si , we successively update
the centroid by matching it to the next shape and moving along the geodesic flow.
Figure 10.1 illustrates the general idea. We propose two alternative ways for the
update step (algorithms 1 and 2 below).
Direct iterative centroid: IC1 The first version of the method computes a centroid
between two objects O1 and O2 by transporting a first object O1 along the geodesic
going from this object to O2 . The transport is stopped depending of the weights of
10 Diffeomorphic Iterative Centroid Methods 285

Fig. 10.1 Illustration of the method. Left image red stars are subjects of the population, the yellow
star is the final Centroid, and orange stars are iterations of the centroid. Right image Final centroid
with the hippocampus population from Data1 (red). See Sect. 10.4.1 for more details about datasets

Fig. 10.2 Diagrams of the iterative processes which lead to the centroid computation. The tops
of the diagrams represent the final centroid. The diagram on the left corresponds to the iterative
centroid algorithms (IC1 and IC2). The diagram on the right corresponds to the pairwise algorithm
(PW)

the objects. If the weight of O1 is w1 , and the weight of O2 is w2 with w1 + w2 = 1,


we stop the deformation of O1 at time t = w2 . Since the method is iterative, the first
two objects are two subjects of the population, for the next step we have as a first
object the previous centroid and as a second object a new subject of the population.
The algorithm proceeds as presented in the Algorithm 1.

Data: N surfaces Si
Result: 1 surface B N representing the centroid of the population
B1 = S1 ;
for i from 1 to N − 1 do
Bi is matched using the Eq. (10.16) to Si+1 which results in a deformation map φvi (x, t);
Set Bi+1 = φvi (Bi , i+1
1
) which means we transport Bi along the geodesic and stop at
time t = i+1
1
;
end
Algorithm 1: Iterative Centroid 1 (IC1)
286 C. Cury et al.

Iterative centroid with averaging in the space of currents: IC2 Because matchings
are inaccurate, the centroid computed with the method presented above accumulates
small errors which can have an impact on the final centroid. Furthermore, the centroid
computed with algorithm 1 is in fact a deformation of the first shape S1 , which makes
the procedure even more dependent on the ordering of subjects than it would be in
an ideal exact matching setting. In this second algorithm, we modify the updating
step by computing a mean in the space of currents between the deformation of the
current centroid and the backward flow of the curent shape being matched. Hence the
computed centroid is not a true surface but a current, i.e. combination of surfaces, as
in the template estimation method. The weights chosen in the averaging reflects the
relative importance of the new shape, so that at the end of the procedure, all shapes
forming the centroid have equal weight N1 . The algorithm proceeds as presented in
Algorithm 2.

Data: N surfaces Si
Result: 1 current B N representing the centroid of the population
B1 = [S1 ];
for i from 1 to N − 1 do
Bi is matched using the Eq. (10.16) to Si+1 which results in a deformation map φvi (x, t);
Set Bi+1 = i+1 i
∗ φvi (Bi , i+1
1
) + i+1
1
[φu i (Si+1 , i+1
i
)] which means we transport Bi
along the geodesic and stop at time t = i+1 ; 1

where u i (x, t) = −vi (x, 1 − t), i.e. φu i is the reverse flow map.
end
Algorithm 2: Iterative Centroid 2 (IC2)

Note that we have used the notation φvi (Bi , i+1


1
) to denote the transport (push-
forward) of the current Bi by the diffeomorphism. Here Bi is a linear combination
of currents associated to surfaces, and the transported current is the linear combina-
tion (keeping the weights unchanged) of the currents associated to the transported
surfaces.
An alternative method: Pairwise Centroid (PW ) Another possibility is to group
objects by pairs, compute centroids (middle points) for each pair, and then recursively
apply the same procedure to the set of centroids, until having only one centroid
(see Fig. 10.2). This pairwise method also depends on the ordering of subjects, and
also provides a centroid which satisfies the definition of Emery and Mokobodzki
(disregarding the inaccuracy of matchings).
When the population is composed of more than 3 subjects, we split the population
in two parts and recursively apply the same splitting until having two or three objects
in each group. We then apply algorithm 1 to obtain the corresponding centroid before
going back up along the dyadic tree, and keeping attention to the weight of each
object. This recursive algorithm is described in algorithm 3.
10 Diffeomorphic Iterative Centroid Methods 287

Data: N surfaces Si
Result: 1 surface B representing the centroid of the population
if N ∧ 2 then
Ble f t = Pairwise Centroid (S1 , ..., S[N /2] );
Bright = Pairwise Centroid (S[N /2]+1 , ..., S N );
Ble f t is matched to Bright which results in a deformation map φv (x, t);
Set B = φv (Ble f t , [N /2]+1
N ) which means we transport Ble f t along the geodesic and stop
[N /2]+1
at time t = N ;
end
else
B = S1
end
Algorithm 3: Pairwise Centroid

10.3.3 Implementation

The methods presented just before need some parameters. Indeed, in each algorithm
we have to compute the matching from one surface to another. For each matching
we minimize the corresponding functional (see Eq. 10.17 at the end of Sect. 10.2.2)
which estimates the news momentum vectors α, which then are used to update the
positions of points xi of the surface. A gradient descent with adaptive step size is used
for the minimization of the functional J . Evaluation of the functional and its gradient
require numerical integrations of high-dimensional ordinary differential equations,
which is done using Euler trapezoidal rule.
The main parameters for computing J are maxiter which is the maximum number
of iterations for the adaptive step size gradient descent algorithm, γ for the regularity
of the matching, and σW and σV the sizes of the kernels which control the metric of
the spaces W and V .
We selected parameters in order to have relatively good matchings in a short time.
We chose γ close enough to zero to enforce the matching to bring the first object
to the second one. Nevertheless, we must be prudent: choosing a γ too small could
be hazardous because the regularity of the deformation could not be preserved. For
each pairwise matching, we use the multi-scale approach described in Sect. 10.2.2
page 5, performing four consecutive optimization processes with decreasing values
by a constant factor of the σW parameter which is the size of the R. K. H. S. W , to
increase the precision of the matching. At the beginning, we fix this σW parameter
with a sufficient large value in order the capture the possible important variations or
differences between shapes. This is for this reason that for the two first minimiza-
tions of the functional, we use a small maxiter parameter. For the results presented
after, we used very small values for the parameter maxiter = [50, 50, 100, 300], to
increase the velocity of the method. Results can be less accurate than in our previous
study [25] which used different values for maxiter : [40, 40, 100, 1000], which take
twice as much time to compute. For the kernel size σV of the deformation space, we
fix this parameter at the beginning and have to adapt it to the size of the data.
288 C. Cury et al.

Fig. 10.3 On the left, an iterative centroid of the dataset data2 (see Sect. 10.4.1 for more details
about datasets) computed using the IC1 algorithm, and on the right the IC2 algorithm

The first method starts from N surfaces, and gives a centroid composed by only
one surface, which is a deformation of the surface used at the initialization step. An
example is shown in Fig. 10.3. This method is rather fast, because at each step we
have to match only one mesh composed by n 1 vertices to another, where n 1 is the
number of vertices of the first mesh of the iterative procedure.
The second method starts from N surfaces and gives a centroid composed of
deformations of all surfaces of the population. At each step it forms a combination
in the space of currents between the current centroid and a backward flow of the new
surface being matched. In practice this implies that the centroid grows in complexity;
at step i its number of vertices is ij=1 j ∗ n j . Hence this algorithm is slower than
the first one, but the mesh structure of the final centroid does not depend on the mesh
of only one subject of the population, and the combination compensates the bias
introduced by the inaccuracy of matchings.
The results of the Iterative Centroid algorithms depend on the ordering of subjects.
We will study this dependence in the experimental part, and also study the effect of
stopping the I. C. before it completes all iterations.

10.4 Experiments and Results

10.4.1 Data

To evaluate our approach, we used data from 95 young (14–16 years old) subjects
from the European database IMAGEN. The anatomical structure that we considered
was the hippocampus, which is a small bilateral structure of the temporal lobe of the
brain involved in memory processes. The hippocampus is one of the first structures
to be damaged in Alzheimer’s disease; it is also implicated in temporal lobe epilepsy,
and is altered in stress and depression. Ninety five left hippocampi were segmented
from T1-weighted Magnetic Resonance Images (MRI) of this database (see Fig. 10.4)
10 Diffeomorphic Iterative Centroid Methods 289

Fig. 10.4 Left panel coronal view of the MRI with the meshes of hippocampi segmented by the
SACHA software [26], the right hippocampus is in green and the left one in pink. Right panel 3D
view of the hippocampi

Fig. 10.5 Top to bottom meshes from Data1 (n = 500), Data2 (n = 95) and RealData (n = 95)

with the software SACHA [26], before computing meshes from the binary masks
using BrainVISA software.1
We denote as RealData the dataset composed of all 95 hippocampi meshes. We
rigidly aligned all hippocampi to one subject of the population. For this rigid regis-
tration, we used a similarity term based on measures (as in [27]) rather than currents.
We also built two synthetic populations of hippocampi meshes, denoted as Data1
and Data2. Data1 is composed of a large number of subjects, in order to test our
algorithms on a large dataset. In order to study separately the effect of the population
size, meshes of this population are simple. Data2 is a synthetic population close
to the real one, with the difference that all subjects have the same mesh structure.

1 http://www.brainvisa.info
290 C. Cury et al.

This allows to test our algorithms in a population with a single mesh structure, thus
disregarding the effects of different mesh structures. These two datasets are defined
as follows (examples of subjects from these datasets are shown on Fig. 10.5):
• Data1 We chose one subject S0 that we decimated (down to 135 vertices) and
deformed using geodesic shooting in 500 random directions with a sufficiently
large kernel and a reasonable momentum vector norm in order to preserve the
overall hippocampal shape, resulting in 500 deformed objects. Each deformed
object was then further transformed by a translation and a rotation of small magni-
tude. This resulted in the 500 different shapes of Data1. All shapes in Data1 have
the same mesh structure. Data1 thus provides a large dataset with simple meshes
and mainly global deformations.
• Data2 We chose the same initial subject S0 that we decimated to 1001 vertices.
We matched this mesh to each subject of the dataset RealData (n = 95), using
diffeomorphic deformation, resulting in 95 meshes with 1001 vertices. Data2 has
more local variability than Data1, and is closer to the anatomical truth.

10.4.2 Effect of Subject Ordering

Each of the 3 proposed algorithms theoretically depends on the ordering of subjects.


Here, we aim to assess the influence of the ordering of subjects on the final centroid
for each algorithm.
For that purpose, we compared several centroids computed with different order-
ings. For each dataset and each algorithm, we computed 10 different centroids. We
computed the mean m1 and maximal distance between all pairs of centroids. The
three datasets have different variabilities. In order to relate the previous mean distance
to the variability, we also computed the mean distance m2 between each centroid
and all subjects of a given dataset. We finally computed the ratio between these two
mean distances m1/m2. Distances between surfaces were computed in the space
of currents, i.e. to compare two surfaces S and T , we computed the squared norm
 [S] − [T ] 2W ∗ . Results are presented in Table 10.1. Additionnaly, we computed the
mean of distances between centroids computed using the different methods. Results
are presented in Table 10.2.
For each dataset and for each type of centroid, the mean of distances between
all 10 centroids is small compared to the mean of distances between the centroid
and the subjects. However, the three algorithms IC1, IC2 and PW were not equally
influenced by the ordering. IC2 seems to be the most stable: the different centroids
are very close one to each other, this being true for all datasets. This was expected
since we reduce the matching error by combining in the space of currents the actual
centroid with the deformation of the new subject along the reverse flow. For IC1, the
distance was larger for Data2 and RealData, which have anatomically more realistic
deformations, than for Data1, which has rather simplistic shapes. This suggests that,
for real datasets, IC1 is more dependent on the ordering than IC2. This is due to the
10 Diffeomorphic Iterative Centroid Methods 291

Table 10.1 Distances between centroids computed with different subjects orderings, for each
dataset and each of the 3 algorithms
From different order To the dataset
mean (m1) max std mean (m2) m1/m2
Data1 IC1 0.8682 1.3241 0.0526 91.25 0.0095
IC2 0.5989 0.9696 0.0527 82.66 0.0072
PW 3.5861 7.1663 0.1480 82.89 0.0433
Data2 IC1 2.4951 3.9516 0.2205 16.29 0.1531
IC2 0.2875 0.4529 0.0164 15.95 0.0181
PW 3.8447 5.3172 0.1919 17.61 0.2184
RealData IC1 4.7120 6.1181 0.0944 18.54 0.2540
IC2 0.5583 0.7867 0.0159 17.11 0.0326
PW 5.3443 6.1334 0.1253 19.73 0.2708
The three first columns present the mean, standard deviation and the maximum of distances between
all pairs of centroids computed with different orderings. The fourth column displays the mean of
distances between each centroid algorithm and all subjects of the datasets. Distances are computed
in the space of currents

Table 10.2 In columns, average distances between centroids computed using the different
algorithms
IC1 versus IC2 IC1 versus PW IC2 versus PW
Data1 1.57 5.72 6.31
Data2 1.89 3.60 3.42
RealData 3.51 5.31 4.96

fact that IC1 provides a less precise estimate of the centroid between two shapes
since it does not incorporate the reverse flow. For all datasets, distances for PW were
larger than those for IC1 and IC2, suggesting that the PW algorithm is the most
dependent on the subjects ordering. Furthermore, centroids computed with PW are
also farther from those computed using IC1 or IC2. Furthermore, we speculate that
the increased sensitivity of PW over IC1 may be due to the fact that, in IC1, n − 1
levels of averaging are performed (and only log2 n for PW) leading to a reduction of
matching errors.
Finally, in order to provide a visualization of the differences, we present match-
ings between 3 centroids computed with the IC1 algorithm, in the case of RealData.
Figure 10.6 shows that shape differences are local and residual. Visually, the 3 cen-
troids are almost similar, and the amplitudes of momentum vectors, which bring one
centroid to another, are small and local.

10.4.3 Position of the Centroids Within the Population

We also assessed whether the centroids are close to the center of the population. To
that purpose, we calculated the ratio
292 C. Cury et al.

Fig. 10.6 a First row 3 initial subjects used for 3 different centroid computations with IC1 (mean
distance between such centroids, in the space of currents, is 4.71) on RealData. Second row the 3
centroids computed using the 3 subjects from the first row as initialization. b Maps of the amplitude
of the momentum vectors that map each centroid to another. Top and bottom views of the maps are
displayed. One can note that the differences are small and local

N
 N1 i=1 v0 (Si )V
R=  N
, (10.27)
i=1 v0 (Si )V
1
N

with v0 (Si ) the vector field corresponding to the initial momentum vector of the
deformation from the template or the centroid to the subject i. This ratio gives some
indication about the centering of the centroid, because in a pure Riemannian setting
(i.e. disregarding the inaccuracies of matchings), a zero ratio would mean that we are
at a critical point of the Fréchet functional, and under some reasonable assumptions
on the curvature of the shape space in the neighbourhood of the dataset (which we
cannot check however), it would mean that we are at the Fréchet mean. To compute
R, we need to match the centroid to all subjects of the population. We computed this
ratio on the best (i.e. the centroid which is the closest to all other centroids) centroid
for each algorithm and for each dataset.
Results are presented in Table 10.3. We can observe that the centroids obtained
with the three different algorithms are reasonably centered for all datasets. Centroids
for Data1 are particularly well centered, which was expected given the nature of this
population. Centroids for Data2 and RealData are slightly less well centered but they
10 Diffeomorphic Iterative Centroid Methods 293

Table 10.3 Ratio values for assessing the position of the representative centroid within the
population, computed using Eq. 10.27 (for each algorithm and for each dataset)
R IC1 IC2 PW
Data1 0.046 0.038 0.085
Data2 0.106 0.102 0.107
RealData 0.106 0.107 0.108

are still close to the Fréchet mean. It is likely that using more accurate matchings (and
thus increasing the computation time of the algorithms) we could reduce this ratio
for RealData and Data2. Besides, one can note that ratios for Data2 and RealData
are very similar; this indicates that the centering of the centroid is not altered by the
variability of mesh structures in the population.

10.4.4 Effects of Initialization on Estimated Template


The initial idea was to have a method which provides a good initialization for template
estimation methods for large databases. We just saw that IC1 and IC2 centroids are
both reasonably centered and do not depend on the subjects ordering. Despite the fact
that IC2 has the smallest sensitivity to the subjects ordering, the method is slower
and provides a centroid composed of N meshes. Because we want to decrease the
computation time for the template estimation of a large database, it is natural to
choose as initialization a centroid composed by only one mesh (time saved in kernel
convolution) in a short time. We advocate to choose IC1 over PW because we can
stop the IC1 algorithm at any step to get a centroid of the sub-population used so far.
Furthermore, PW seems to be more sensitive to subjects ordering.
Now, we study the impact of the use of a centroid, computed with the IC1 algo-
rithm, as initialization for the template estimation method presented in Sect. 10.3.1.
To that purpose, we compared the template obtained using a standard initialization,
denoted as T (Std I nit), to the template initialized with IC1 centroid, denoted as
T (I C1). We chose to stop the template estimation method after 7 iterations of the
optimization process. We arbitrarily chose this number of iterations, it is large enough
to have a good convergence for T (I C1) and to have an acceptable convergence for
T (Std I nit). We did not use a stopping criterion based on the W ∗ metric because it
is highly dependent on the data and is difficult to establish when using a multiscale
approach. In addition to comparing T (I C1) to T (Std I nit), we also compared the
templates corresponding to two different IC1 intialization based on two different
orderings. We compared the different templates in the space of currents. Results are
presented in Table 10.4. We also computed the same ratios R as in equation 10.27.
Results are presented in Table 10.5.
One can note that the differences between T(IC1) for different orderings are small
for Data1 and Data2 and larger for RealData, suggesting that these are due to the
mesh used for the initialization step. We can also observe that templates initialized
294 C. Cury et al.

Table 10.4 Distances between templates initialized via differents IC1 (T (I C1)) for each datasets,
and the distance between template initialized via the standard initialization (T (Std I nit)) and tem-
plates initialized via IC1
T(IC1) versus T(IC1) T(IC1) versus T(StdInit)
Data1 0.9833 40.9333
Data2 0.6800 20.4666
RealData 4.0433 26.8667

Table 10.5 Ratios R for templates (T(IC1)) and for the template with its usual initialization
T (Std I nit), for each datasets
R T(IC1) T(StdInit)
Data1 0.0057 0.0062
Data2 0.0073 0.0077
RealData 0.0073 0.0074

Fig. 10.7 Estimated template from RealData. On the left, initialized via the standard initialization
which is the whole population. On the right, estimated template initialized via a IC1 centroid

via IC1 are far, in terms of distances in the space of currents, from the template
initialized by the standard initialization. These results could be alarming, but the
results of ratios (see Table 10.5) prove that templates are all very close to the Fréchet
mean, and that the differences are not due to a bad template estimation. Moreover,
both templates are visually similar as seen on Fig. 10.7.

10.4.5 Effect of the Number of Iterations for Iterative Centroids

Since it is possible to stop the Iterative Centroid methods IC1 and IC2 at any step,
we wanted to assess the influence of computing only a fraction of the N iterations
on the estimated template. Indeed one may wonder if computing an I.C. at e.g.
40 % (then saving 60 % of computation time for the IC method) could be enough
10 Diffeomorphic Iterative Centroid Methods 295

100 100 100


80 80 80
60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
100
100 100
80
80 80
60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Fig. 10.8 First row Graphs of average W ∗ -distances between the IC1 at x% and the final one. The
second row present the same results with IC2

Table 10.6 Results of initialization of template estimation method by a IC1 at 40 %


Data1 Data2 RealData
T(IC1 at 40 %) versus T(StdInit) 41.41 24.41 24.82
T(IC1 at 40 %) versus T(IC1 at 100 %) 9.36 9.56 6.18
R value for T(IC1 at 40 %) 0.040 0.106 0.105

to initialize a template estimation. Moreover, for large datasets, the last subject will
have a very small influence: for a database composed of 1000 subjects, the weight
of the last subject is 1/1000. We performed this experiment in the case of IC1. In
the following, we call “IC1 at x%” an IC1 computed using x × N /100 subjects of
the population.
We computed the distance in the space of currents between “IC1 at x%” and IC1.
Results are presented on Fig. 10.8. These distances are averaged over the 10 centroids
computed for each datasets. We can note that after processing 40 % of the population,
the IC1 covers more than 75 % of the distance to the final centroid for all datasets.
We also compared T(IC1 at 40 %) to T(IC1) and to T(StdInit), using distances in
the space of currents, as well as the R ratio defined in Eq. 10.27. Results are shown on
Table 10.6). They show that using 40 % of subjects lowers substantially the quality
of the resulting template. Indeed the estimated template seems trapped in the local
minimum found by the IC1 at 40 %. We certainly have to take into account the size of
the dataset. Nevertheless, we believe that if the dataset is very large and sufficiently
homogeneous we could stop the Iterative Centroid method before the end.

10.4.6 Computation Time


To speed up the matchings, we use a GPU implementation for the computation of
kernel convolutions, which constitutes the most time-consuming part of LDDMM
296 C. Cury et al.

Table 10.7 Computation time (in h) for Iterative Centroids and for template estimation initialised
by IC1 (T (I C1)), the standard initialization (T(StdInit)) and by IC1 at 40 % (T(IC1 at 40 %))
Computation time (h) Data1 Data2 RealData
IC1 1.7 0.7 1.2
IC2 5.2 2.4 7.5
PW 1.4 0.7 1.2
T (I C1) 21.1(= 1.7 + 19.4) 13.3(= 0.7 + 12.6) 27.9(= 1.2 + 26.7)
T(StdInit) 96.1 20.6 99
T(IC1 at 40 %) 24.4(= 0.7 + 23.7) 10.4(= 0.3 + 10.1) 40.7(= 0.5 + 40.2)
For T (I C1), we give the complete time for the whole process i.e. the time for the IC1 computation
plus the time for T (I C1) computation itself

methods. Computations were performed on a Nvidia Tesla C1060 card. Computation


times are displayed in Table 10.7.
We can note that the computation time of IC1 is equal to the one of PW and that
these algorithms are faster than the IC2 algorithm, as expected. The computation
time for any IC method (even for IC2) is much lower (by a factor from 10 to 80 %)
than the computation time of the template estimation method. Morevoer, initializing
the template estimation with IC1 can save up to 70 % of computation time over the
standard initialization. On the other hand, using T(IC1 at 40 %) does not reduce
computation time compared to using T(IC1).
It could be interesting to evaluate the parameters which would lead to a more
precise centroid estimate in a time that would still be inferior to that needed for the
template estimation. We should also mention that one could speed up computations
by adding a Matching Pursuit on currents as described in [9].

10.5 Conclusion

We have proposed a new approach for the initialization of template estimation meth-
ods. The aim was to reduce computation time by providing a rough initial estimation,
making more feasible the application of template estimation on large databases.
To that purpose, we proposed to iteratively compute a centroid which is correctly
centered within the population. We proposed three different algorithms to compute
this centroid: the first two algorithms are iterative (IC1 and IC2) and the third one is
recursive (PW). We have evaluated the different approaches on one real and two syn-
thetic datasets of brain anatomical structures. Overall, the centroids computed with
all three approaches are close to the Fréchet mean of the population, thus providing
a reasonable centroid or initialization for template estimation method. Furthermore,
for all methods, centroids computed using different orderings are similar. It can be
noted that IC2 seems to be more robust to the ordering than IC1 which in turns seems
more robust than PW. Nevertheless, in general, all methods appear relatively robust
with respect to the ordering.
10 Diffeomorphic Iterative Centroid Methods 297

The advantage of iterative methods, like IC1 and IC2, is that we can stop the
deformation at any step, resulting in a centroid built with part of the population.
Thus, for large databases (composed for instance of 1000 subjects), it may not be
necessary to include all subjects in the computation since the weight of these subjects
will be very small. The iterative nature of IC1 and IC2 provides another interesting
advantage which is the possible online refinement of the centroid estimation as new
subjects are entered in the dataset. This leads to an increased possibility of interaction
with the image analysis process. On the other hand, the recursive PW method has
the advantage that it can be parallelized (still using GPU implementation), although
we did not implement this specific feature in the present work.
Using the centroid as initialization of the template estimation can substantially
speed up the convergence. For instance, using IC1 (which is the fastest one) as initial-
ization saved up 70 % of computation time. Moreover, this method could certainly
be used to initialize other template estimation methods, such as the method proposed
by Durrleman et al. [6].
As we observed, the centroids, obtained with rough parameters, are close to the
Fréchet mean of the population, thus we believe that computing IC with more pre-
cise parameters (but still reasonable in terms of computation time), we could obtain
centroids closer to the center. This accurate centroid could be seen as a cheap alter-
native to true template estimation methods, particularly if computing a precise mean
of the population of shapes is not required. Indeed, in the LDDMM framework,
template-based shape analysis gives only a first-order, linearized approximation of
the geometry in shape space. In a future work, we will study the impact of using IC
as a cheap template on results of population analysis based for instance on kernel
principal component analysis. Finally, the present work deals with surfaces for which
the metric based on currents seems to be well-adapted. Nevertheless, the proposed
algorithms for centroid computation are general and could be applied to images,
provided that an adapted metric is used.

Acknowledgments The authors are grateful to Vincent Frouin, Jean-Baptiste Poline, Roberto Toro
and Edouard Duschenay for providing a sample of subjects from the IMAGEN dataset and to Marie
Chupin for the use of the SACHA software.The authors thank Professor Thomas Hotz for his
suggestion on the pairwise centroid, during the discussion of the GSI’13 conference. The research
leading to these results has received funding from ANR (project HM-TC, grant number ANR-09-
EMER-006, and project KaraMetria, grant number ANR-09-BLAN-0332) and from the program
Investissements d’ avenir ANR-10-IAIHU-06.

References

1. Grenander, U., Miller, M.I.: Computational anatomy: an emerging discipline. Q. Appl. Math.
56(4), 617–694 (1998)
2. Christensen, G.E., Rabbitt, R.D., Miller, M.I.: Deformable templates using large deformation
kinematics. IEEE Trans. Image Process. 5(10), 1435–1447 (1996)
298 C. Cury et al.

3. Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings
via geodesic flows of diffeomorphisms. Int. J. Comput. Vision 61(2), 139–157 (2005)
4. Vaillant, M., Miller, M.I., Younes, L., Trouvé, A.: Statistics on diffeomorphisms via tangent
space representations. Neuroimage 23, S161–S169 (2004)
5. Glaunès, J., Joshi, S.: Template estimation from unlabeled point set data and surfaces for com-
putational anatomy. In: Pennec, X., Joshi, S., (eds.) Proceedings of the International Workshop
on the Mathematical Foundations of Computational Anatomy (MFCA-2006), pp. 29–39, 1 Oct
2006
6. Durrleman, S., Pennec, X., Trouvé, A., Ayache, N., et al.: A forward model to build unbiased
atlases from curves and surfaces. In: 2nd Medical Image Computing and Computer Assisted
Intervention. Workshop on Mathematical Foundations of Computational Anatomy, pp. 68–79
(2008)
7. Durrleman, S., Prastawa, M., Korenberg, J.R., Joshi, S., Trouvé, A., Gerig, G.: Topology pre-
serving atlas construction from shape data without correspondence using sparse parameters.
In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) Medical Image Computing and
Computer-Assisted Intervention—MICCAI 2012. Lecture Notes in Computer Science, vol.
7512, pp. 223–230. Springer, Berlin (2012)
8. Ma, J., Miller, M.I., Trouvé, A., Younes, L.: Bayesian template estimation in computational
anatomy. Neuroimage 42(1), 252–261 (2008)
9. Durrleman, S., Pennec, X., Trouvé, A., Ayache, N.: Statistical models of sets of curves and
surfaces based on currents. Med. Image Anal. 13(5), 793–808 (2009)
10. Arnaudon, M., Dombry, C., Phan, A., Yang, L.: Stochastic algorithms for computing means of
probability measures. Stoch. Process. Appl. 122(4), 1437–1455 (2012)
11. Ando, T., Li, C.K., Mathias, R.: Geometric means. Linear Algebra Appl. 385, 305–334 (2004)
12. Schwartz, L.: Théorie des distributions. Bull. Amer. Math. Soc. 58, 78–85 (1952) 0002–9904
13. de Rham, G.: Variétés différentiables. Formes, courants, formes harmoniques. Actualits Sci.
Ind., no. 1222, Publ. Inst. Math. Univ. Nancago III. Hermann, Paris (1955)
14. Vaillant, M., Glaunes, J.: Surface matching via currents. In: Information Processing in Medical
Imaging, pp. 381–392. Springer, Berlin (2005)
15. Glaunes, J.: Transport par difféomorphismes de points, de mesures et de courants pour la
comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13 (2005)
16. Durrleman, S.: Statistical models of currents for measuring the variability of anatomical curves,
surfaces and their evolution. PhD thesis, University of Nice-Sophia Antipolis (2010)
17. Yang, X.F., Goh, A., Qiu, A.: Approximations of the diffeomorphic metric and their applications
in shape learning. In: Information Processing in Medical Imaging (IPMI), pp. 257–270 (2011)
18. Tenenbaum, J., Silva, V., Langford, J.: A global geometric framework for nonlinear dimen-
sionality reduction. Science 290(5500), 2319–2323 (2000)
19. Kendall, W.S.: Probability, convexity, and harmonic maps with small image i: uniqueness and
fine existence. Proc. Lond. Math. Soc. 3(2), 371–406 (1990)
20. Karcher, H.: Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math.
30(5), 509–541 (1977)
21. Le, H.: Estimation of riemannian barycentres. LMS J. Comput. Math. 7, 193–200 (2004)
22. Afsari, B.: Riemannian Lp center of mass: existence, uniqueness, and convexity. Proc. Am.
Math. Soc. 139(2), 655–673 (2011)
23. Afsari, B., Tron, R., Vidal, R.: On the convergence of gradient descent for finding the riemannian
center of mass. SIAM J. Control Optim. 51(3), 2230–2260 (2013)
24. Emery, M., Mokobodzki, G.: Sur le barycentre d’une probabilité dans une variété. In: Séminaire
de probabilités, vol. 25, pp. 220–233. Springer, Berlin (1991)
25. Cury, C., Glaunès, J.A., Colliot, O.: Template estimation for large database: a diffeomorphic
iterative centroid method using currents. In: Nielsen, F., Barbaresco, F. (eds.) GSI. Lecture
Notes in Computer Science, vol. 8085, pp. 103–111. Springer, Berlin (2013)
10 Diffeomorphic Iterative Centroid Methods 299

26. Chupin, M., Hammers, A., Liu, R.S.N., Colliot, O., Burdett, J., Bardinet, E., Duncan, J.S.,
Garnero, L., Lemieux, L.: Automatic segmentation of the hippocampus and the amygdala
driven by hybrid constraints: method and validation. Neuroimage 46(3), 749–761 (2009)
27. Glaunes, J., Trouvé, A., Younes, L.: Diffeomorphic matching of distributions: a new approach
for unlabelled point-sets and sub-manifolds matching. In: Proceedings of the 2004 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 712–718
(2004)
Chapter 11
Hartigan’s Method for k-MLE: Mixture
Modeling with Wishart Distributions
and Its Application to Motion Retrieval

Christophe Saint-Jean and Frank Nielsen

Abstract We describe a novel algorithm called k-Maximum Likelihood Estimator


(k-MLE) for learning finite statistical mixtures of exponential families relying on
Hartigan’s k-means swap clustering method. To illustrate this versatile Hartigan
k-MLE technique, we consider the exponential family of Wishart distributions and
show how to learn their mixtures. First, given a set of symmetric positive definite
observation matrices, we provide an iterative algorithm to estimate the parameters
of the underlying Wishart distribution which is guaranteed to converge to the MLE.
Second, two initialization methods for k-MLE are proposed and compared. Finally,
we propose to use the Cauchy-Schwartz statistical divergence as a dissimilarity mea-
sure between two Wishart mixture models and sketch a general methodology for
building a motion retrieval system.

Keywords Mixture modeling · Wishart · k-MLE · Bregman divergences · Motion


retrieval

11.1 Introduction and Prior Work

Mixture models are a powerful and flexible tool to model an unknown probability
density function f (x) as a weighted sum of parametric density functions p j (x; θ j ):

C. Saint-Jean (B)
Mathématiques, Image, Applications (MIA), Université de La Rochelle,
17000 La Rochelle, France
e-mail: [email protected]
F. Nielsen
Sony Computer Science Laboratories, Inc., 3-14-13 Higashi Gotanda,141-0022 Shinagawa-Ku,
Tokyo, Japan
F. Nielsen
Laboratoire d’Informatique (LIX), Ecole Polytechnique, Palaiseau Cedex, France

F. Nielsen (ed.), Geometric Theory of Information, 301


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_11,
© Springer International Publishing Switzerland 2014
302 C. Saint-Jean and F. Nielsen


K 
K
f (x) = w j p j (x; θ j ), with w j > 0 and w j = 1. (11.1)
j=1 j=1

By far, the most common case are mixtures of Gaussians for which the Expectation-
Maximization (EM) method is used for decades to estimate the parameters {(w j , θ j )} j
from the maximum likelihood principle. Many extensions aimed at overcoming its
slowness and lack of robustness [1]. From the seminal work of Banerjee et al. [2], sev-
eral methods have been generalized for the exponential families in connection with
the Bregman divergences. In particular, the Bregman soft clustering provides a unify-
ing and elegant framework for the EM algorithm with mixtures of exponential fam-
ilies. In a recent work [3], the k-Maximum Likelihood Estimator (k-MLE) has been
proposed as a fast alternative to EM for learning any exponential family mixtures:
k-MLE relies on the bijection of exponential families with Bregman divergences to
transform the mixture learning problem into a geometric clustering problem. Thus
we refer the reader to the review paper [4] for an introduction to clustering.
This paper proposes several variations around the initial k-MLE algorithm with
a specific focus on mixtures of Wishart [5]. Such a mixture can model complex
distributions over the set S++d of d × d symmetric positive definite matrices. Data of

this kind comes naturally in some applications like diffusion tensor imaging, radar
imaging but also artificially as signature for a multivariate dataset (region of interest
in an multispectral image or a temporal sequence of measures for several sensors).
In the literature, the Wishart distribution is rarely used for modeling data but more
often in bayesian approaches as a (conjugate) prior for the inverse covariance-matrix
of a gaussian vector. This justifies that few works concern the estimation of the
parameters of Wishart from a set of matrices. To the best of our knowledge, the only
and most related work is the one of Tsai [6] concerning MLE and Restricted-MLE
with ordering constraints. From the application viewpoint, one may cite polarimetric
SAR imaging [7], bio-medical imaging [8]. Another example is a recent paper on
people tracking [9] which applies Dirichlet process mixture model (infinite mixture
model) to the clustering of covariance matrices.
The paper is organized as follows: Sect. 11.2 recalls the definition of an expo-
nential family (EF), the principle of maximum likelihood estimation in EFs and
how it is connected with Bregman divergences. From these definitions, the com-
plete description of k-MLE technique is derived by following the formalism of the
Expectation-Maximization algorithm in Sect. 11.3. In the same section, the Hartigan
approach for k-MLE is proposed and discussed as well as how to initialize it properly.
Section 11.4 concerns the learning of a mixture of Wishart with k-MLE. For this pur-
pose, a iterative procedure that converges to the MLE when it exists. In Sect. 11.5, we
describe an application scenario to motion retrieval before concluding in Sect. 11.6.

11.2 Preliminary Definitions and Notations

An exponential family is a set of probability distributions admitting the following


canonical decomposition:
11 Hartigan’s Method for k-MLE 303

p F (x; θ) = exp {⊂t (x), θ∗ + k(x) − F(θ)}

with t (x) the sufficient statistic, θ the natural parameter, k the carrier measure and
F the log-normalizer [10]. Most of commonly used distributions such Bernoulli,
Gaussian, Multinomial, Dirichlet, Poisson, Beta, Gamma, von Mises are indeed ex-
ponential families (see above reference for a complete list). Later on in the chapter,
a canonical decomposition of the Wishart distribution as an exponential family will
be detailed.

11.2.1 Maximum Likelihood Estimator

The framework of exponential families gives a direct solution for finding the max-
imum likelihood estimator θ̂ from a set of i.i.d observations χ = {x1 , . . . , x N }.
Denoting L the likelihood function


N 
N
L(θ; χ) = p F (xi ; θ) = exp {⊂t (xi ), θ∗ + k(xi ) − F(θ)} (11.2)
i=1 i=1

and l¯ the average log-likelihood function


N
¯ χ) = 1
l(θ; (⊂t (xi ), θ∗ + k(xi ) − F(θ)) . (11.3)
N
i=1

¯ χ) for θ satisfies
It follows that the MLE θ̂ = arg maxΘ l(θ;

1 
N
∈ F(θ̂) = t (xi ). (11.4)
N
i=1

Recall that the functional reciprocal (∈ F)−1 of ∈ F is also ∈ F ∗ for F ∗ the convex
conjugate of F [11]. It is a mapping from the expectation parameter space H to the
natural parameter space Θ. Thus, the MLE is obtained by mapping (∈ F)−1 on the
average of sufficient statistics:
 ⎛
1 
N
−1
θ̂ = (∈ F) t (xi ) . (11.5)
N
i=1

Whereas determining (∈ F)−1 may be trivial for some univariate distributions like
Bernoulli, Poisson, Gaussian, multivariate case is much challenging and lead to
consider approximate methods to solve this variational problem [12].
304 C. Saint-Jean and F. Nielsen

11.2.2 MLE and Bregman Divergence

In this part, the link between MLE and Kullback-Leibler (KL) divergence is recalled.
Banerjee et. al. [2] interpret the log-density of a (regular) exponential family as a
(regular) Bregman divergence:

log p F (x; θ) = −B F ∗ (t (x) : η) + F ∗ (t (x)) + k(x), (11.6)

where F ∗ is the convex conjugate (Legendre transform) of F. Skipping a formal


definition, a Bregman divergence for a strictly convex and differentiable function
ϕ : Ω ≡∇ R is

Bϕ (ω1 : ω2 ) = ϕ(ω1 ) − ϕ(ω2 ) − ⊂ω1 − ω2 , ∈ϕ(ω2 )∗. (11.7)

From a geometric viewpoint, Bϕ (ω1 : ω2 ) is the difference between the value of ϕ at


ω1 and its first-order Taylor expansion around ω2 evaluated at ω1 . Since ϕ is convex,
Bϕ is positive, zero if and only if (iff) ω1 = ω2 but not symmetric in general. The
expression of F ∗ (and thus of B F ∗ ) follows from (∈ F)−1

F ∗ (η) = ⊂(∈ F)−1 (η), η∗ − F((∈ F)−1 (η)). (11.8)

In Eq. 11.6, term B F ∗ (t (x) : η) says how much sufficient statistic t (x) on observation
x is dissimilar to η ◦ H.
The Kullback-Leibler divergence on two members of the same exponential family
is equivalent to the Bregman divergence of the associated log-normalizer on swapped
natural parameters [10]:

KL( p F (.; θ1 )|| p F (.; θ2 )) = B F (θ2 : θ1 ) = B F ∗ (η1 : η2 ). (11.9)

Let us remark that B F is always known in a closed-form using the canonical decom-
position of p F whereas B F ∗ requires the knowledge of F ∗ . Finding the maximizer
of the log likelihood on Θ amounts to find the minimizer η̂ of


N 
N
B F ∗ (t (xi ) : η) = KL ( p F ∗ (.; t (xi ))|| p F ∗ (.; η))
i=1 i=1

on H since the two last terms in Eq. (11.6) are constant with respect to η.

11.3 Learning Mixtures with k-MLE

This section presents how to fit a mixture of exponential families with k-MLE.
This algorithm requires to have a MLE (see previous section) for each component
distribution p F j of the considered mixture. As it shares many properties with the EM
11 Hartigan’s Method for k-MLE 305

algorithm for mixtures, this latter is recalled first. The heuristics Lloyd and Hartigan
for k-MLE are completely described. Also, two methods for the initialization of
k-MLE are proposed depending whether or not component distributions are known.

11.3.1 EM Algorithm

Mixture modeling is a convenient framework to address the problem of cluster-


ing defined as the partitioning of a set of i.i.d observations χ = {xi }i=1,..,N into
“meaningful” groups regarding to some similarity. Consider a finite mixture model
of exponential families (see Eq. 11.1)


K
f (x) = w j p F j (x; θ j ), (11.10)
j=1

where K is the number of components and w j are the mixture weights which sum
up to unity. Finding mixture parameters {(w j , θ j )} j can be again addressed by max-
imizing the log likelihood of the mixture distribution


N 
K
L({(w j , θ j )} j ; χ) = log w j p F j (xi ; θ j ). (11.11)
i=1 j=1

For K > 1, a sum of terms appearing inside a logarithm makes optimization much
more difficult than the one of Sect. 11.2.1 (K = 1). A classical solution, also well
suitable for clustering purpose, is to augment model with indicatory hidden vector
variables z i where z i j = 1 iff observation xi is generated for jth component and 0
otherwise. Previous equation is now replaced by the complete log likelihood of the
mixture distribution


N 
K
⎝ ⎞
Lc ({(w j , θ j )} j ; {(xi , z i )}i ) = z i j log w j p F j (xi ; θ j ) . (11.12)
i=1 j=1

This is typically the framework of the Expectation-Maximization (EM) algorithm


[13] which optimizes this function by repeating two steps:
(t) (t)
1. Compute Q({(w j , θ j )} j , {(w j , θ j )} j ) the conditional expectation of Lc w.r.t.
the observed data χ given an estimate {(w(t) (t)
j , θ j )} j for mixture parameters. This
(t)
step amounts to compute ẑ i = E{(w(t) ,θ(t) )} [z i |xi ], the vector of responsibilities
j j j
for each component to have generated xi .
306 C. Saint-Jean and F. Nielsen

(t) (t)
(t)
w j p F j (xi ; θ j )
ẑ i j = ⎠ (t) (t)
. (11.13)
j→ w j → p F j → (xi ; θ j → )

2. Update mixture parameters by maximizing Q (i.e. Eq. (11.12) where hidden values
(t)
z i j are replaced by ẑ i j ).

⎠N (t)
(t+1) i=1 ẑ i j (t+1)

N
(t) ⎝ ⎞
ŵ j = , θ̂ j = arg max ẑ i j log p F j (xi ; θ j ) . (11.14)
N θ j ◦Θ j i=1

While ŵ(t+1)
j is always known in closed-form whatever F j are, θ̂(t+1)
j are obtained
by component-wise specific optimization involving all observations.
Many properties of this algorithm are known (e.g. maximization of Q implies max-
imization of L, slow convergence to local maximum, etc...). In a clustering per-
spective, components are identified to clusters and values ẑ i j are interpreted as soft
membership of xi to cluster C j . In order to get a strict partition after the convergence,
each xi is assigned to the cluster C j iff ẑ i j is maximum over ẑ i1 , ẑ i2 , . . . , ẑ i K .

11.3.2 k-MLE with Lloyd Method

A main reason for the slowness of EM is that all observations are taken into account
(t)
for the update of parameters for each component since ẑ i j ◦ [0, 1]. A natural idea is
then to generate smaller sub-samples of χ from ẑ i(t) 1
j in a deterministic manner. The
simplest way to do this is to get a strict partition of χ with MAP assignment:

(t) (t)
(t) 1 if ẑ i j = maxk ẑ ik
z̃ i j = .
0 otherwise

When multiple maxima exist, the component with the smallest index is chosen. If this
classification step is inserted between E-step and M-step, Classification EM (CEM)
algorithm [14] is retrieved. Moreover, for isotropic gaussian components with fixed
unit variance, CEM is shown to be equivalent to the Lloyd K-means algorithm [4].
More recently, CEM was reformulated in a close way under the name k-MLE [3]
for the context of exponential families and Bregman divergences. In the following
(t) (t)
of the paper, we will refer only to this latter. Replacing z i j by z̃ i j in Eq. (11.12), the
criterion to be maximized in the M-step can be reformulated as

1Otherwise, convergence to a pointwise estimate of the parameters would be replaced by conver-


gence in distribution of a Markov chain.
11 Hartigan’s Method for k-MLE 307

(t)

K 
N
(t) ⎝ ⎞
L̃c ({(w j , θ j )} j ; {(xi , z̃ i )}i ) = z̃ i j log w j p F j (xi ; θ j ) . (11.15)
j=1 i=1

Following CEM terminology,⎨ this quantity is⎩called the “classification maximum


(t) (t)
likelihood”. Letting C j = xi ◦ χ|z̃ i j = 1 , this equation can be conveniently
rewritten as
(t)
 ⎝ ⎞
L̃c ({(w j , θ j )} j ; {C j } j ) = log w1 p F1 (x; θ1 )
(t)
x◦C1
 ⎝ ⎞
+ ··· + log w K p FK (x; θ K ) . (11.16)
(t)
x◦C K

Each term leads to a separate optimization to get the parameters of the corresponding
component:

(t)
(t+1)
|C j | (t+1)

ŵ j = , θ̂ j = arg max log p F j (x; θ j ). (11.17)
N θ j ◦Θ j (t)
x◦C j

Last equation is nothing but the equation of the MLE for the j-th component with
a subset of χ. Algorithm 1 summarizes k-MLE with Lloyd method given an initial
description of the mixture.

Algorithm 1: k-MLE (Lloyd method)


(0) (0)
Input: A sample χ = {x1 , x2 , ..., x N }, initial mixture parameters {ŵ j , θ̂ j } j , {F j } j
log-normalizers of exponential families
(t) (t)
Output: Ending values for {ŵ j , θ̂ j } j are estimates of mixture parameters,
(t)
C j a partition of χ
1 t = 0;
2 repeat
3 repeat
// Partition χ in K disjoint subsets with MAP assignment
(t) (t) (t)
4 foreach xi ◦ χ do z̄ i = arg max j log ŵ j p F j (xi ; θ̂ j );
(t) (t)
5 C j = {xi ◦ χ|z̄ i = j};
// Update parameters {θ j } j with MLE ({w j } j unchanged)
(t+1) ⎠
6
foreach j ◦ 1, ..., K do θ̂ j = arg maxθ j ◦Θ j x◦C (t) log p F j (x; θ j );
j
7 t = t +1;
8 until Convergence of the classification maximum likelihood (Eq. 11.16);
// Update mixture weights {w j } j
9 foreach j ◦ 1, ..., K do ŵ (t+1)
j = |C (t)
j |/N ;
10 until Further convergence of the classification maximum likelihood (Eq. 11.16);
308 C. Saint-Jean and F. Nielsen

Contrary to CEM algorithm, k-MLE algorithm updates mixture weights after the
convergence of the L̃c (line 8) and not simultaneously with component parameters.
Despite this difference, both algorithms can be proved to converge to a local maxi-
mum of L̃c with same kind of arguments (see [3, 14]). In practice, the local maxima
(and also the mixture parameters) are not necessary equal for the two algorithms.

11.3.3 k-MLE with Hartigan Method

In this section, a different optimization of the classification maximum likelihood is


presented. A drawback of previous methods is that they can produce empty clusters
without any mean of control. It occurs especially when observations are in a high
dimensional space. A mild solution is to discard empty clusters by setting their
weights to zero and their parameters to ∅. A better approach, detailed in the following,
is to rewrite the k-MLE following the same principle as Hartigan method for k-means
[15]. Moreover, this heuristic is preferred to Lloyd’s one since it generally provides
better local maxima [16].
Hartigan method is generally summarized by the sentence “Pick an observation,
say xc in cluster Cc , and optimally reassign it to another cluster.” Let us first consider
as “optimal” the assignment xc to its most probable cluster, say C j ∗ :

(t) (t)
j ∗ = arg max log ŵ j p F j (xc ; θ̂ j ),
j

(t) (t)
where ŵ j , and θ̂ j denote the weight and the parameters of the j-th component at
some iteration. Then, parameters of the two components are updated with MLE:

θ̂c(t+1) = arg max log p Fc (x; θc ) (11.18)
θc ◦Θc (t)
x◦Cc \{xc }

(t+1)

θ̂ j ∗ = arg max log p F j ∗ (x; θ j ∗ ). (11.19)
θ j ∗ ◦Θ j ∗ (t)
x◦C j ∗ ←{xc }

The mixture weights ŵc and ŵ j ∗ remain unchanged in this step (see line 9 of
Algorithm 1). Consequently, L̃c increases by Φ (t) (xc , Cc , C j ∗ ) where Φ (t) (xc , Cc , C j )
is more generally defined as
 (t+1)
 (t) ŵc
Φ (t) (xc , Cc , C j ) = log p Fc (x; θ̂c )− log p Fc (x; θ̂c ) − log
(t) (t)
ŵ j
x◦Cc \{xc } x◦Cc ←{xc }
 (t+1)
 (t)
+ log p F j (x; θ̂ j )− log p F j (x; θ̂ j ).
(t) (t)
x◦C j ←{xc } x◦C j \{xc }
(11.20)
11 Hartigan’s Method for k-MLE 309

This procedure is nothing more than a partial assignment (C-step) in the Lloyd
method for k-MLE. This is an indirect way to reach our initial goal which is the
maximization of L̃c .
Following Telgarsky and Vattani [16], a better approach is to consider as “optimal”
the assignment to cluster C j which maximizes Φ (t) (xc , Cc , C j )

j ∗ = arg max Φ (t) (xc , Cc , C j ). (11.21)


j

Since Φ (t) (xc , Cc , Cc ) = 0, such assignment satisfies Φ (t) (xc , Cc , C j ∗ ) ≤ 0 and there-
fore the increase of L̃c . As the optimization space is finite (partitions of {x1 , ..., x N }),
this procedure converges to a local maximum of L̃c . There is no guarantee that C j ∗
coincides with the MAP assignment for xc .
For the k-means loss function, Hartigan method avoids empty clusters since any
assignment to one of those empty clusters decreases it necessarily [16]. Analo-
gous property will be now studied for k-MLE through the formulation of L̃c with
η-coordinates:

(t)
L̃c ({(w j , η j )} j ; {C j } j ) (11.22)
K  ⎤
 ⎥
= F j∗ (η j ) + k j (x) + ⊂t j (x) − η j , ∈ F j∗ (η j )∗ + log w j .
j=1 x◦C (t)
j

(t) (t) ⎠
Recalling that the MLE satisfies η̂ j = |C j |−1 (t)
x◦C j
t j (x), dot product vanishes
(t)
when η j = η̂ j and it follows after small calculations

Φ (t) (xc , Cc , C j ) = (|Cc(t) | − 1)Fc∗ (η̂c(t+1) ) − |Cc(t) |Fc∗ (η̂c(t) )


(t) (t+1) (t) (t)
+ (|C j | + 1)F j∗ (η̂ j ) − |C j |F j∗ (η̂ j )
ŵc
+ k j (xc ) − kc (xc ) − log . (11.23)
ŵ j

As far as F j∗ is known in closed-form, this criterion is faster to compute than


Eq. (11.20) since updates of component parameters are immediate

(t) (t)
|Cc |η̂c − tc (xc ) (t+1)
|C (t) (t)
j |η̂ j + t j (x c )
η̂c(t+1) = , η̂ j = . (11.24)
|Cc(t) | − 1 |C (t)
j |+1

(t)
When Cc = {xc }, there is no particular reason for Φ (t) (xc , {xc }, C j ) to be always
negative. Simplications occurring for the k-means in euclidean case (e.g. k j (xc ) = 0,
clusters have equal weight w j = K −1 , etc...) do not exist in this more general case.
310 C. Saint-Jean and F. Nielsen

Thus, in order to avoid to empty a cluster, it is mandatory to reject every outgoing


transfer for a singleton cluster (cf. line 8).
Algorithm 2 details k-MLE algorithm with Hartigan method when F j∗ are avail-
able. When only F j are known, Φ (t) (xc , Cc , C j ) can be computed with Eq. (11.20).
In this case, the computation of MLE θ̂ j is much slower and is an issue for a singleton
cluster. Its existence and possible solutions will be discussed later for the Wishart
distribution. Further remarks on Algorithm 2:
(line 1) When all F j∗ = F ∗ are identical, this partitioning can be understood as
geometric split in the expectation parameter space induced by divergence
B F ∗ and additive weight − log w j (weighted Bregman Voronoi diagram
[17]).
(line 4) This permutation avoids same ordering for each loop.
(line 6) A weaker choice may be done here: any cluster C j (for instance the first)
which satisfies Φ (t) (xi , Cz̄i , C j ) > 0 is a possible candidate still guar-
anteeing convergence of the algorithm. For such clusters, it may be also
(t)
advisable to select C j with maximum ẑ i j .
(line 12) Obviously, this criterion is equivalent to local convergence of L̃c .
As said before, this algorithm is faster when components parameters η j can be
updated in the expectation parameter space H j . But the price to pay is the memory
needed to keep all sufficient statistics t j (xi ) for each observation xi .

11.3.4 Initialization with DP-k-MLE++

To complete the description of k-MLE, it remains the problem of the initialization


of the algorithm: choice of the exponential family for each component, initial values
of {(ŵ (0) (0)
j , η j )} j , number of components K . Ideally, a good initialization would be
fast, select automatically the number of components (unknown for most applications)
and provide initial mixture parameters not too far from a good local minimum of the
clustering criterion. The choice of model complexity (i.e. the number K of groups) is a
recurrent problem in clustering since a compromise has to be done between genericity
and goodness of fit. Since the likelihood increases with K , many criteria such as
BIC, NEC are based on the penalization of likelihood by a function of the degree
of freedom of the model. Other approaches include MDL principle, Bayes factor
or simply a visual inspection of some plottings (e.g. silhouette graph, dendrogram
for hierarchical clustering, Gram matrix, etc...). The reader interested by this topic
may refer to section M in the survey of Xu and Wunsch [18]. Proposed method,
inspired by the algorithms k-MLE++ [3] and DP-means [19], will be described in
the following.
At the beginning of a clustering, there is no particular reason for favoring one
particular cluster among all others. Assuming uniform weighting for components,
L̃c simplifies to
11 Hartigan’s Method for k-MLE 311

Algorithm 2: k-MLE (Hartigan method)


(0) (0)
Input: Sample χ = {x1 , .., x N }, initial mixture parameters {(ŵ j , η̂ j )} j=1,..,K , {(t j , F j∗ )} j
sufficient statistics and dual log-normalizers of exponential families
(t) (t) (t)
Output: Ending values for {(ŵ j , η̂ j )} j are estimates of mixture parameters, C j a
partition of χ
// Partition χ in K disjoint subsets with MAP assignment
(0) (0) (0)
1 foreach xi ◦ χ do z̄ i = arg min j (B F j∗ (t j (xi ) : η̂ j ) − log ŵ j );
(0) (0)
2 foreach j ◦ 1, ..., K do C j = {xi ◦ χ|z̄ i = j};
3 repeat
4 done_transfer = False;
(t) (t)
5 Random permute (x1 , z̄ 1 ), ..., (x N , z̄ N );
(t)
foreach xi ◦ χ such that |C (t) | > 1 do
6 z̄ i
// Test optimal transfer for xi (see Eqs. 11.23 or 11.20)
7 j ∗ = arg min j Φ (t) (xi , Cz̄ (t) , C j );
i

8 if Φ (t) (xi , Cz̄ (t) , C j ∗ ) > 0 then


i
// Update clusters and membership of xi
9
(t+1) (t) (t+1) (t) (t+1)
C (t) =C (t) \{x i }, C j∗ = C j ∗ ← {xi }, z̄ i = j∗
z̄ i z̄ i

// Update only ηz̄i , η j ∗ with MLE ({w j } j unchanged)


10
(t) (t) (t) (t)
|C (t) |η̂ (t) − tz̄ (t) (x i ) |C (t+1) |η̂ (t+1) + tz̄ (t+1) (x i )
(t+1) z̄ i z̄ i i (t+1) z̄ i z̄ i i
η̂ (t) = , η̂ (t+1) =
z̄ i |C (t+1) (t) | z̄ i |C (t+1)
(t+1) |
z̄ i z̄ i

done_transfer = True; t = t +1;


11 if done_transfer is True then
// Update mixture weights {w j } j
(t) (t)
12 foreach j ◦ 1, ..., K do ŵ j = N −1 |C j |;
13 until done_transfer is False;

(t)

K 
L̊c ({θ j } j ; {C j } j ) = log p F j (x; θ j ) or equivalently to (11.25)
j=1 x◦C (t)
j

(t)

K  ⎤ ⎥
L̊c ({η j } j ; {C j } j ) = F j∗ (η j ) + k j (x) + ⊂t j (x) − η j , ∈ F j∗ (η j )∗ .
j=1 x◦C (t)
j (11.26)

(t)
When all F j∗ = F ∗ are identical and the partition {C j } j corresponds to MAP
assignment, L̊ is exactly the objective function L̆ for the Bregman k-means [2].
Rewriting L̆ as an equivalent criterion to be minimized, it follows
312 C. Saint-Jean and F. Nielsen


N
K
L̆({η j } j ) = min B F ∗ (t (xi ) : η j ). (11.27)
j=1
i=1

(0)
Bregman k-means++ [20, 21] provides initial centers {η j } j which guarantee to find
a clustering that is O(log K )-competitive to the optimal Bregman k-means clustering.
The k-MLE++ algorithm amounts to use Bregman k-means++ on the dual log-
normalizer F ∗ (see Algorithm 3).

Algorithm 3: k-MLE++
Input: A sample χ = {x1 , ..., x N }, t the sufficient statistics and F ∗ the dual log-normalizer
of an exponential family, K the number of clusters
(0) (0)
Output: Initial mixture parameters {(w j , η j } j )
(0)
1 w1 = 1/K ;
(0)
2 Choose first seed η1 = t (xi ) for i uniformly random in {1, 2, . . . , N };
3 for j = 2, ..., K do
4 w (0)
j = 1/K ;
// Compute relative contributions to L̆({η j } j )
j
min j → =1 B F ∗ (t (xi ):η j → )
foreach xi ◦ χ do pi = ⎠N j ;
5 i → =1
min j → =1 B F ∗ (t (xi → ):η j → )
(0)
6 Choose η j ◦ {t (x1 ), ..., t (x N )} with probability pi ;

When K is unknown, same strategy can still be applied but a stopping criterion
has to be set. Probability pi in Algorithm 3 is a relative contribution of observation xi
through t (xi ) to L̆({η1 , ..., η K }) where K is the number of already selected centers.
A high pi indicates that xi is relatively far from these centers, thus is atypic to the
(0) (0) (0) (0) (0)
mixture {(w1 , η1 ), ..., (w K , η K ), } for w j = w (0) an arbitrary constant. When
selecting a new center, pi necessarily decreases in the next iteration. A good covering
of χ is obtained when all pi are lower than some threshold λ ◦ [0, 1]. Algorithm 4
describes the initialization named after DP-k-MLE++.
The higher the threshold λ, the lower the number of generated centers. In partic-
ular, the value N1 should be considered as a reasonable minimum setting for λ. For
λ = 1, the algorithm will simply return one center. Since pi = 0 for already selected
centers, this method guarantees all centers to be distinct.

11.3.5 Initialization with DP-comp-k-MLE

Although k-MLE can be used with component-wise exponential families, previous


initialization methods yield components of same exponential family. Component
distribution may be chosen simultaneously to a center selection when additional
11 Hartigan’s Method for k-MLE 313

Algorithm 4: DP-k-MLE++
Input: A sample χ = {x1 , ..., x N }, t the sufficient statistics and F ∗ the dual log-normalizer
of an exponential family, λ ◦ [0, 1]
(0) (0)
Output: Initial mixture parameters {w j , η j } j , K the number of clusters
(0)
1 Choose first seed η1 = t (xi ) for i uniformly random in {1, 2, . . . , N };
2 K=1;
3 repeat
// Compute relative contributions to L̆({η1 , ..., η K })
min Kj=1 B F ∗ (t (xi ):η j )
foreach xi ◦ χ do pi = ⎠N ;
4 i → =1
min Kj=1 B F ∗ (t (xi → ):η j → )
5 if ∃ pi > λ then
6 K = K+1;
// Select next seed
7 Choose η (0)
K ◦ {t (x 1 ), ..., t (x N )} with probability pi ;
8 until all pi ≤ λ;
(0)
9 for j = 1, ..., K do w j = 1/K ;

knowledge ξi about xi is available (see Sect. 11.5 for an example). Given such a choice
function H , Algorithm 5 called “DP-comp-k-MLE” describes this new flexible ini-
tialization method. DP-comp-k-MLE is clearly a generalization of DP-k-MLE++
when H always returns the same exponential family. However, in the general case, it
remains to be proved whether a DP-comp-k-MLE clustering is O(log K )-competitive
to the optimal k-MLE clustering (with equal weight). Without this difficult theoretical
study, suffix “++” is carefully omitted in the name DP-comp-k-MLE.
To end up with this section, let us recall that all we need to know for using
proposed algorithms is the MLE for the considered exponential family, whether it is
available in a closed-form or not. In many exponential families, all details (canonical
decomposition, F, ∈ F, F ∗ , ∈ F ∗ = (∈ F)−1 ) are already known [10]. The next
section focuses on the case of the Wishart distribution.

11.4 Learning Mixtures of Wishart with k-MLE

This section recalls the definition of Wishart distribution and proposes a maximum
likelihood estimator for its parameters. Some known facts such as the Kullback-
Leibler divergence between two Wishart densities are recalled. Its use with the above
algorithms is also discussed.

11.4.1 Wishart Distribution

The Wishart distribution [5] is the multidimensional version of the chi-square dis-
tribution and it characterizes empirical scatter matrix estimator for the multivariate
gaussian distribution. Let X be a n-sample consisting in independent realizations of
314 C. Saint-Jean and F. Nielsen

Algorithm 5: DP-comp-k-MLE
Input: A sample χ = {x1 , ..., x N } with extra knowledge ξ = {ξ1 , ..., ξ N }, H a choice
function of an exponential family, λ ◦ [0, 1]
Output: Initial mixture parameters {(w (0) (0) ∗
j , η j )} j , {(t j , F j )} j sufficient statistics and dual
log-normalizers of exponential families, K the number of clusters
// Select first seed and exponential family
1 for i uniformly random in {1, 2, . . . , N } do
2 Obtain t1 , F1∗ from H (xi , ξi );
(0)
3 Select first seed η1 = t1 (xi );
4 K=1;
5 repeat
min Kj=1 B F ∗ (t j (xi ):η j )
foreach xi ◦ χ do pi = ⎠N
j
;
6 i → =1
min Kj=1 B F ∗ (t j (xi → ):η j → )
j
7 if ∃ pi > λ then
8 K = K+1;
// Select next seed and exponential family
9 for i with probability pi in {1, 2, . . . , N } do
10 Obtain t K , FK∗ from H (xi , ξi );
(0)
11 Select next seed η K = t K (xi );

12 until all pi ≤ λ;
13 for j = 1, ..., K do w (0)
j = 1/K ;

a random gaussian vector with d dimensions, zero mean and covariance matrix S.
Then scatter matrix X = tXX follows a central Wishart distribution with scale matrix
S and degree of freedom n, denoted by X ⊆ Wd (n, S). Its density function is
n−d−1 ⎫ ⎬
|X | 2 exp − 21 tr(S −1 X )
Wd (X ; n, S) = nd n ⎝ ⎞ ,
2 2 |S| 2 Γd n2

d(d−1) ⎭d
 
j−1
where for y > 0, Γd (y) = π 4 j=1 Γ y − 2 is the multivariate gamma
function. Let us remark immediately that this definition implies that n is constrained
to be strictly greater than d − 1.
Wishart distribution is an exponential family since
⎜ 
1
Wd (X ; θn , θ S ) = exp < θn , log |X | >R + < θ S , − X > H S + k(X ) − F(θn , θ S ) ,
2

where (θn , θ S ) = ( n−d−1


2 , S −1 ), t (X ) = (log |X |, − 21 X ), ⊂, ∗ H S denotes the Hilbert-
Schmidt inner product and
11 Hartigan’s Method for k-MLE 315
 ⎟  ⎟
(d + 1) (d + 1)
F(θn , θ S ) = θn + (d log(2) − log |θ S |) + log Γd θn + .
2 2
(11.28)
Note that this decomposition is not unique (see another one in [22]). Refer to
Appendix A.1 for detailed calculations.

11.4.2 MLE for Wishart Distribution

Let us recall (see Sect. 11.2.1) that the MLE is obtained by mapping (∈ F)−1 on the
average of sufficient statistics. Finding (∈ F)−1 amounts to solve here the following
system (see Eqs. 11.5 and 11.28):
  
 d log(2) − log |θ S | + Ψd θn + (d+1)
= ηn ,
  2
(11.29)
⎪ − θn + (d+1) θ−1 = ηS .
2 S

with ηn = E[log |X |] and η S = E[− 21 X ] the expectation parameters and Ψd the


derivative of the log Γd . Unfortunately, variables θn and θ S are not separable so
that no closed-form solution is known. Instead, as pointed out in [23], it is possible
to adopt an iterative scheme that alternatively yields maximum likelihood estimate
when the other parameter is fixed. This is equivalent to consider two sub-families
Wd,n and Wd,S of Wishart distribution Wd which are also exponential families.
For the sake of simplicity, natural parameterizations and sufficient statistics of the
decomposition in the general case are kept (see Appendices A.2 and A.3 for more
details).
n−d−1
Distribution Wd,n (n = 2θn + d + 1): kn (X ) = 2 log |X | and

nd n n 
Fn (θ S ) = log(2) − log |θ S | + log Γd . (11.30)
2 2 2
Using classical results for matrix derivatives, (Eq. 11.5) can be easily solved:
 N ⎛−1
1  1 
N
n −1
− θ̂ S = − X i =∧ θ̂ S = N n Xi . (11.31)
2 N 2
i=1 i=1

Distribution Wd,S (S = θ−1


S ): k S (X ) = − 21 tr(S −1 X ) and
 ⎟  ⎟
d +1   d +1
FS (θn ) = θn +  
log 2S + log Γd θn + (11.32)
2 2

Again, Eq. (11.5) can be numerically solved:


316 C. Saint-Jean and F. Nielsen
 ⎛
1   
N
d +1
θ̂n = Ψd−1 log |X i | − log 2S  − , θ̂n > −1 (11.33)
N 2
i=1

with Ψd−1 the functional reciprocal of Ψd . This latter can be computed with any
optimization method on bounded domain (e.g. Brent method [24]). Let us mention
that notation is simplified here since θ̂ S and θ̂n should have been indexed by their
corresponding family. Algorithm 6 summarizes the estimate θ̂ for parameters of
⎠ −1 ⎠N
the Wishart distribution. By precomputing N N
i=1 X i and N −1 i=1 log |X i |,
much computation time can be saved. The computation of the Ψd−1 remains an
expensive part of the algorithm.
Let us now prove the convergence and the consistency of this
⎠method. Maximiz-
ing l¯ amounts to minimize equivalently E(θ) = F(θ) − ⊂ N1 i=1 N
t (X i ), θ∗. The
following properties are satisfied by E:
• The hessian ∈ 2 E = ∈ 2 F of E is positive definite on Θ since F is convex.
⎠N
• Its unique minimizer on Θ is the MLE θ̂ = ∈ F ∗ ( N1 i=1 t (X i )) whenever it
exists (although F ∗ is not known for Wishart, and F is not separable).

Algorithm 6: MLE for the parameters of a Wishart distribution


Input: A sample χ = {X 1 , X 2 , . . . , X N } of S++
d with N > 1

Output: Estimate θ̂ is the terminal values of MLE sequences {θ̂n(t) } and {θ̂(t)
S }
(t)
// Initialization of the {θ̂n } sequence
(0)
1 θ̂n = 1; t = 0;
2 repeat
// Compute MLE in Wd,n using Eq. 11.31
 N ⎛−1

θ̂(t+1)
S = Nn Xi with n = 2θ̂n(t) + d + 1
i=1
// Compute MLE in Wd,S using Eq. 11.33
 ⎛
1 
N
  d +1  
(t+1) −1   (t+1) −1
θ̂n = Ψd log |X i | − log 2S − with S = θ̂ S
N 2
i=1
t = t + 1;
3 until convergence of the likelihood;

Therefore, Algorithm 6 is an instance of the group coordinate descent algorithm of


Bezdek et al. (Theorem 2.2 in [25]) for θ = (θn , θ S ):

(t+1)
θ̂ S = arg max E(θ̂n(t) , θ S ) (11.34)
θS
(t+1)
θ̂n(t+1) = arg max E(θn , θ̂ S ) (11.35)
θn
11 Hartigan’s Method for k-MLE 317

(t) (t)
Resulting sequences {θ̂n } and {θ̂ S } are shown to converge linearly to the coordinates
of θ̂.
By looking carefully at the previous algorithms, let us remark that the initialization
methods require to able to compute the divergence B F ∗ between two elements η1 and
η2 in the expectation space H. Whereas F ∗ is known for Wd,n and Wd,S , Eq. (11.9)
gives a potential solution for Wd by considering B F on natural parameters θ2 and θ1
in Θ. Searching the correspondence H ≡∇ Θ is analogous to compute the MLE for
a single observation...
The previous MLE procedure does not converge with a single observation X 1 .
Bogdan and Bogdan [26] proved that MLE exists and is unique in an exponential
family off the affine envelope of the N points t (X 1 ), ..., t (X N ) is of dimension D, the
order of this exponential family. Since the affine envelope of t (X 1 ) is of dimension
d × d (instead of D = d × d + 1), the MLE does not exists and the likelihood function
goes to infinity.2 Unboundedness of likelihood function is well known problem that
can be tackled by adding a penalty term to it [27]. A simpler solution is to take the
MLE in family Wd,n for some n (known or arbitrary fixed above d − 1) instead
of Wd .

11.4.3 Divergences for Wishart Distributions

For two Wishart distributions Wd1 = Wd (X ; n 1 , S1 ) and Wd2 = Wd (X ; n 2 , S2 ), the


KL divergence is known [22] (even if F ∗ is unknown):
 ⎝ n1 ⎞ ⎛  ⎟
Γd n1 − n2 n 
1
KL(Wd1 ||Wd2 ) = − log ⎝2⎞ + Ψd
Γd n22 2 2
 ⎟
n1 |S1 |
+ − log + tr(S2−1 S1 ) − d (11.36)
2 |S2 |

Looking the KL divergences of the two Wishart sub-families Wd,n and Wd,S gives
an interesting perspective to this formula. Applying Eqs. 11.9 and 11.8, it follows
 ⎟
n |S1 |
1
KL(Wd,n ||Wd,n
2
)= − log + tr(S2−1 S1 ) − d (11.37)
2 |S2 |
 ⎝ ⎞⎛  ⎟
Γd n21 n1 − n2 n 
1
KL(Wd,S ||Wd,S ) = − log
1 2 ⎝ n2 ⎞ + Ψd (11.38)
Γd 2 2 2

Detailed calculations can be found in the Appendix. Notice that KL(Wd1 ||Wd2 ) is
simply the sum of these two divergences

(t) (t)
2 Product θ̂n θ̂ S is constant through iterations.
318 C. Saint-Jean and F. Nielsen

Fig. 11.1 20 random matrices from Wd (.; n, S) from n = 5 (left), n = 50 (right)

KL(Wd1 ||Wd2 ) = KL(Wd,S


1
1
||Wd,S
2
1
) + KL(Wd,n
1
1
||Wd,n
2
1
) (11.39)

and that KL(Wd,S1 ||W 2 ) does not depend on S.


d,S
Divergence KL(Wd,n 1 ||W 2 ), commonly used as a dissimilarity measure between
d,n
covariance matrices, is sometimes referred as the log-Det divergence due to the form
of ϕ(S) = Fn (S) ≥ log |S| (see Eq. 11.30). However, the dependency on term n
should be neglected only when the two empirical covariance matrices comes from
samples of the same size. In this case, log-Det divergence between two covariance
matrices is the KL divergence in the sub-family Wd,n .

11.4.4 Toy Examples

In this part, some simple simulations are given for d = 2. Since the observations are
positive semi-definite matrices, it is possible to visualize them with ellipses para-
metrized by their eigen decompositions. For example, Fig. 11.1 shows 20 matrices
generated from Wd (.; n, S) for n = 5 and for n = 50 with S having eigenvalues
{2, 1}. This visualization highlights the difficulty for the estimation of the parameters
(even for d small) when n is small.
Then, a dataset of 60 matrices is generated from a three components mix-
ture with parameters Wd (.; 10, S1 ), Wd (.; 20, S2 ), Wd (.; 30, S3 ) and equal weights
w1 = w2 = w3 = 1/3. The respective eigenvalues for S1 , S2 , S3 are in turn
{2, 1}, {2, 0.5}, {1, 1}. Figure 11.2 illustrates this dataset. To study the influence of
a good initialization for k-MLE, the Normalized Mutual Information (NMI) [28]
is computed between the final partition and the ground-truth partition for different
initializations. This value between 0 and 1 is higher when the two partitions are more
similar. Following table gives average and standard deviation of NMI over 30 runs:

Rand. Init/Lloyd Rand. Init/Hartigan k-MLE++/Hartigan


NMI 0.229 ± 0.279 0.243 ± 0.276 0.67 ± 0.083
11 Hartigan’s Method for k-MLE 319

Fig. 11.2 Dataset: 60 ma-


trices generated from a three
components mixture of W2

Fig. 11.3 DP-k-MLE++:


Influence of λ on K and on
the average log-likelihood

From this small experiment, we can easily verify the importance of a good initial-
ization. Also, the partitions having the highest NMI are reported in Fig. 11.4 for each
method. Let us mention that Hartigan method gives almost always a better partition
than the Lloyd’s one for the same initial mixture.
A last simulation indicates that the initialization with DP-k-MLE++ is very
sensible to its parameter λ. Again with the same set of matrices, Fig. 11.3 shows how
the number of generated clusters K and the average log-likelihood evolve with λ.
Not surprisingly, both quantities decrease when λ increases.
320 C. Saint-Jean and F. Nielsen

Fig. 11.4 Best partitions with Rand. Init/Lloyd (left), Rand. Init/Hartigan (middle), k-MLE++
Hartigan (right)

11.5 Application to Motion Retrieval

In this section, a potential application to motion retrieval is proposed following our


previous work [23]. Raw motion-captured movement can be identifiable to a n i × d
matrix Xi where each row corresponds to captured locations of a set of sensors.

11.5.1 Movement Representation

When the aim is to provide a taxonomy of a set of movements, it is difficult to


compare varying-size matrices. Cross-product matrices X i = t Xi Xi is a possible
descriptor3 of Xi . Denoting N the number of movements, set {X 1 , ..., X N } of d × d
matrix is exactly the input of k-MLE. Note that d can easily be of high dimension
when the number of sensors is large.
The simplest way to initialize k-MLE in this setting is to apply DP-k-MLE++
for Wd . But when n i are known, it is better not to estimate them. In this case, DP-
comp-k-MLE is appropriate for a function H selecting Wd,n i given ξi = n i . When
learning algorithm is fast enough, it is common practice to restart it for different
initializations and to keep the best output (mixture parameters).
To enrich the description of a single movement, it is possible to define a mixture
m i per movement Xi . For example, several subsets of successive observations with
different sizes can be extracted and their cross-product matrices used as inputs for
k-MLE (and DP-comp-k-MLE). Mixture m i can be viewed as a sparse representation
of local dynamics of Xi through their local second-order moments.
While these two representations are of different kind, it is possible to encompass
both in a common framework for Xi described by a mixture of a single component
{(wi,1 = 1, ηi,1 = t (X i ))}. Algorithm k-MLE applied on such input for all move-
ments (i.e. {t (X i )}i ) provides then another set of mixture parameters {(ŵ j , η̂ j )} j .
Note that the general treatment of arbitrary mixtures of mixtures of Wishart is not
claimed to be addressed here.

3 For translation invariance, Xi are column centered before.


11 Hartigan’s Method for k-MLE 321

11.5.2 Querying with Cauchy-Schwartz Divergence

Let us consider a movement X (a n × d matrix) and its mixture representation m.


Without loss of generality, let us denote {(w j , θ j )} j=1..K the mixture parameters for
m. The problem of comparing two movements amounts to compute a appropriate
dissimilarity between m and another mixture m → of such a kind with parameters
{(w →j , θ→j )} j=1..K → .
When both mixtures have a single component (K = K’ = 1), an immediate solution
is to consider the Kullback-Leibler divergence KL(m : m) for two members of
the same exponential family. Since it is the Bregman divergence on the swapped
natural parameters B F (θ→ : θ), a closed form is always available from Eq. (11.7). It
is important to mention that this formula holds for θ and θ→ viewed as parameters for
Wd even if they are estimated in sub-families Wd,n and Wd,n → .
For general mixtures of the same exponential family (K > 1 or K → > 1), KL
divergence admits no more a closed form and has to be approximate with numerical
methods. Recently, other divergences such as the Cauchy-Schwartz divergence (CS)
[29] were shown to be available in a closed form:

→ m(x)m → (x)dx
CS(m : m ) = − log   . (11.40)
m(x)2 dx m → (x)2 dx

Within the same exponential family p F , the integral of the product of mixtures is

 
K 
K → 

m(x)m (x)dx = w j w →j → p F (x; θ j ) p F (x; θ→j → )dx. (11.41)
j=1 j → =1

When carrier measure k(X ) = 0, as it is for Wd but not for Wd,n and Wd,S , the
integral can be further expanded as
 
⊂θ→j ,t (X )∗−F(θ→j → )
p F (x; θ j ) p F (x; θ→j → )dX = e⊂θ j ,t (X )∗−F(θ j ) e dX

⊂θ j +θ→j → ,t (X )∗−F(θ j )−F(θ→j → )
= e dX

F(θ j +θ→j → )−F(θ j )−F(θ→j → ) ⊂θ j +θ→j → ,t (X )∗−F(θ j +θ→j → )
=e e dX .
  !
=1

Note that θ j + θ→j → must be in the natural parameter space Θ to ensure that F(θ j + θ→j → )
is finite. An equivalent condition is that Θ is a convex cone.
p
When p F = Wd , space Θ =] − 1; +∀[×S++ is not a convex cone since
→ →
θn j + θn j → < −1 for n j and n j → smaller than d + 1. Practically, this constraint
is tested for each parameter pairs before going on with the computation the CS
322 C. Saint-Jean and F. Nielsen

divergence. A possible fix, not developed here, would be to constraint n to be greater


than d + 1 (or equivalently θn > 0). Such a constraint amounts to take a convex
subset ]0; +∀[×S++ of Θ. Denoting Δ(θ j , θ→j → ) = F(θ j + θ→j → ) − F(θ j ) − F(θ→j → ),
p

the CS divergence is also

1 ⎤ K K ⎥

CS(m : m ) = log w j w j → expΔ(θ j ,θ j → ) (within m)
2 → j=1 j =1
K→
⎤ K→ ⎥
1 Δ(θ→ ,θ→ )
+ log w →j w →j → exp j j → (within m → )
2 →
j=1 j =1
K ⎤ →

K  ⎥
Δ(θ ,θ→ )
− log w j w →j → exp j j → (between m and m → )
j=1 j → =1
(11.42)

Note that CS divergence is symmetric since Δ(θ j , θ→j → ) is. A numeric value of
Δ(θ j , θ→j → ) can be computed for Wd from Eq. 11.28 (see Eq. 11.45 or 11.46 in the
Appendix).

11.5.3 Summary of Proposed Motion Retrieval System

To conclude this section, let us recall the elements of our proposal for a motion
retrieval system. Movement is represented by a Wishart mixture model learned by
k-MLE initialized by DP-k-MLE++ or DP-comp-k-MLE. In the case of a mixture
of a component, a simple application of the MLE for Wd,n is sufficient. Although a
Wishart distribution appears inadequate model for the scatter matrix X of a move-
ment, it has been shown that this crude assumption provides a good classification rates
on a real data set [23]. Learning representations of the movements may be performed
offline since it is computational demanding. Using CS divergence as dissimilarity, we
can then extract a taxonomy of movements with any spectral clustering algorithm. For
a query movement, its representation by a mixture has to be computed first. Then it is
possible to search the database for the most similar movements according to the CS di-
vergence or to predict its type by a majority vote among them. More details of the im-
plementation and results for the real dataset will be in a forthcoming technical report.

11.6 Conclusions and Perspectives

Hartigan’s swap clustering method for k-MLE was studied for the general case of
an exponential family. Unlike for k-means, this method does not guarantee to avoid
empty clusters but achieves generally better performance than the Lloyd’s heuristic.
11 Hartigan’s Method for k-MLE 323

Two methods DP-k-MLE and DP-comp-k-MLE are proposed to initialize k-MLE


automatically by setting the number of clusters. While the former shares the good
properties of k-MLE, the latter selects the component distributions given some extra
knowledge. A small experiment indicates these methods appear to be quite sensible
to their only parameter.
We recalled the definition and some properties of the Wishart distribution Wd ,
especially its canonical decomposition as a member of an exponential family.By
fixing either one of its two parameters n and S, two other (nested) exponential (sub-)
families Wd,n and Wd,S may be defined. From their respective MLEs, it is possible
to define an iterative process which provably converges to the MLE for Wd . For a
single observation, the MLE does not exist.Then a crude solution is to replace the
MLE in Wd by the MLE in one of the two sub-families.
The MLE is an example of a point estimator among many others (e.g. method of
moments, minimax estimators, Bayesian point estimators). This suggests as future
work many other learning algorithms such as k-MoM, k-Minimax [30], k-MAP
following the same algorithmic scheme as k-MLE.
Finally, an application to the retrieval motion-captured motions is proposed.Each
motion is described by a Wishart mixture model and the Cauchy-Schwarz divergence
is used as a dissimilarity measure between two mixture models.As the CS divergence
is always available in closed-form, such divergence is fast to compute compared to
stochastic integration estimation schemes.This divergence can be used in spectral
clustering methods and for visualization of a set of motions in an Euclidean embed-
ding.
Another perspective is the connection between the closed-form divergences be-
tween mixtures and kernels based on divergences [31]: The CS divergence looks
similar to the Normalized Correlation Kernel [32].This could lead to a broader class
of methods (e.g., SVM) using these divergences.

Appendix A

This Appendix details some calculations for distributions Wd , Wd,n , Wd,S .

A.1 Wishart Distribution Wd

n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = nd n
2 2 |S| 2 Γd ( n2 )
⎜ n  
n−d −1 1 nd n
= exp log |X | − tr(S −1 X ) − log(2) − log |S| − log Γd
2 2 2 2 2
324 C. Saint-Jean and F. Nielsen

Letting (θn , θ S ) = ( n−d−1


2 , S −1 ) √∇ (n, S) = (2θn + d + 1, θ−1
S )

2θn + d + 1 − d − 1 1 (2θn + d + 1)d
Wd (X ; θn , θ S ) = exp log |X | − tr(θ S X ) − log(2)
2 2 2
 ⎟
(2θn + d + 1) 2θn + d + 1
− log |θ−1
S | − log Γd
2 2
⎜  ⎟
1 (d + 1)
= exp θn log |X | − tr(θ S X ) − θn + (d log(2) − log |θ S |)
2 2
 ⎟
(d + 1)
− log Γd θn +
2
⎜ 
1
= exp < θn , log |X | >R + < θ S , − X > H S −F(Θ)
2
 ⎟  ⎟
(d + 1) (d + 1)
with F(Θ) = θn + (d log(2) − log |θ S |) + log Γd θn +
2 2
= exp {< Θ, t (X ) > −F(Θ) + k(X )}
1
witht (X ) = (log |X |, − X ) and k(X ) = 0
2

 ⎟  ⎟
(d + 1) (d + 1)
F(Θ) = θn + (d log(2) − log |θ S |) + log Γd θn +
2 2
 ⎟
∂F (d + 1)
(θn , θ S ) = d log(2) − log |θ S | + Ψd θn + (11.43)
∂θn 2

where Ψd is the multivariate Digamma function (or multivariate polygamma of


order 0).  ⎟
∂F (d + 1) −1
(θn , θ S ) = − θn + θS (11.44)
∂θ S 2

Dissimilarity Δ(θ, θ→ ) between natural parameters θ = (θn , θ S ) and θ→ = (θn→ , θ→S ) is


 ⎟
(d + 1) ⎝ ⎞
Δ(θ, θ→ ) = F(θ + θ→ ) − (F(θ) + F(θ→ )) = θn + θn→ + d log(2) − log |θ S + θ→S |
2
 ⎟  ⎟
(d + 1) (d + 1) ⎝ ⎞
− θn + (d log(2) − log |θ S |) − θn→ + d log(2) − log |θ→S |
2 2
 ⎟  ⎟  ⎟
(d + 1) (d + 1) (d + 1)
+ log Γd θn + θn→ + − log Γd θn + − log Γd θn→ +
2 2 2
 ⎟  ⎟
(d + 1) (d + 1) (d + 1)
=− d log(2) + θn + log |θ S | + θn→ + log |θ→S |
2 2 2
   
 ⎟ Γd θn + θn→ + (d+1)
(d + 1) 2
− θn + θn→ + log |θ S + θ→S | + log     
2 Γ θ + (d + 1) Γ θ→ + (d + 1)
d n 2 d n 2
(11.45)
11 Hartigan’s Method for k-MLE 325

Remark Δ(θ, θ) ⊗= 0. Same quantity with source parameters λ = (n, S) and λ→ =


(n → , S → ) is

(d + 1) n n→ n + n→ − d − 1
Δ(λ, λ→ ) = − d log(2) − log |S| − log |S → | − log |S −1
2 2 2 2
  →

Γd n+n 2−d−1
+ S →−1 | + log  ⎝ ⎞  →   (11.46)
Γd n2 Γd n2

A.2 Distribution Wd,n

n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = nd n n
2 2 |S| 2 Γd ( 2 )
⎜  n 
n−d −1 1 nd n
= exp log |X | − tr(S −1 X ) − log(2) − log |S| − log Γd
2 2 2 2 2

Letting θ S = S −1 ,
⎜  n 
1 n−d −1 nd n
Wd (X ; n, θ S ) = exp − tr(θ S X ) + log |X | − log(2) − log |θ−1
S | − log Γd
2 2 2 2 2
⎜ 
1
= exp < θ S , − X > H S +k(X ) − Fn (θ S )
2
nd n n 
with Fn (θ S ) = log(2) − log |θ S | + log Γd
2 2 2
n−d −1
with kn (X ) = log |X |
2

∂log|X |
Using the rule ∂X =t (X −1 ) [33] and the symmetry of θ S , we get

n
∈θ S Fn (θ S ) = − θ−1
2 S
The correspondence between natural parameter θ S and expectation parameter η S is
n n −1
η S = ∈θ S Fn (θ S ) = − θ−1 ∗ −1
S √∇ θ S = ∈η S Fn (η S ) = (∈θ S Fn ) (η S ) = − η S
2 2
Finally, we obtain the MLE for θ S in this sub family:
 ⎛−1  N ⎛−1
1  1 
N
n
θ̂ S = − − Xi = nN Xi
2 N 2
i=1 i=1
326 C. Saint-Jean and F. Nielsen

Same formulation with source parameter S:


  N ⎛−1 −1 ⎠N
 Xi
Ŝ = θ̂−1
S = n N Xi  = i=1
nN
i=1

Dual log-normalizer Fn∗ for Wd,n is

Fn∗ (η S ) = ⊂(∈ Fn )−1 (η S ), η S ∗ − Fn ((∈ Fn )−1 (η S ))


n n
= ⊂− η −1 , η S ∗ − Fn (− η −1 )
2 S 2 S
n nd n ⎤ n ⎥ n 
= − tr(η −1
S ηS ) − log(2) + log ( )d | − η −1 S | − log Γd
2 2 2 2 2
nd n n 
= − (1 + log(2) − log n + log 2) + log | − η −1 S | − log Γd
2 2 2
nd  n  n  n 
= log + log | − η −1 S | − log Γd
2 4e 2 2
1
KL(Wd,n ||Wd,n
2
) = B Fn (θ S2 : θ S1 )
= Fn (θ S2 ) − Fn (θ S1 ) − <θ S2 − θ S1 , ∈θ S Fn (θ S1 )>
n⎝ ⎞ n
= log |θ S1 | − log |θ S2 | + tr((θ S2 − θ S1 )θ−1 S1 )
2 2 ⎟
n |θ S |
= log 1 + tr(θ S2 θ−1 S1 ) − d
2 |θ S2 |
 ⎟
n |θ S |
= − log 2 + tr(θ S2 θ−1 ) − d
2 |θ S1 | S1

also with source parameter


 ⎟
n |S1 |
1
KL(Wd,n ||Wd,n
2
)= − log + tr(S2−1 S1 ) − d
2 |S2 |

Let’s remark that KL divergence depends now on n.

B Fn∗ (η S1 : η S2 ) = Fn∗ (η S1 ) − Fn∗ (η S2 ) − <η S1 − η S2 , ∈ Fn∗ (η S2 )> H S


n  n −1
= log | − η −1 −1
S1 | − log | − η S2 | − <η S1 − η S2 , − η S2 > H S
2 ⎛ 2
−1
n | − η S1 |
= log + tr(η S1 η −1
S2 ) − d
2 | − η −1
S2 |
11 Hartigan’s Method for k-MLE 327

A.3 Distribution Wd,S

For fixed S, the p.d.f of Wd,S can be rewritten4 as

n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = n
|2S| 2 Γd ( n2 )
⎜ n  
n−d −1 1 n
= exp log |X | − tr(S −1 X ) − log |2S| − log Γd
2 2 2 2

Letting θn = n−d−1
2 (n = 2θn + d + 1)
⎜  ⎟  ⎟
1 d +1   d +1
Wd (X ; θn , S) = exp θn log |X | − tr(S −1 X ) − θn + log 2S  − log Γd θn +
2 2 2
⎫ ⎬
= exp < θn , log |X | > +k S (X ) − FS (θn )
 ⎟  ⎟
d +1   d +1
with FS (θn ) = θn + log 2S  + log Γd θn +
2 2
1 −1
with k S (X ) = − tr(S X )
2

The correspondence between natural parameter θn and expectation parameter ηn is


 ⎟
  (d + 1)
ηn = ∈θn FS (θn ) = log 2S  + Ψd θn +
2

   
↔ Ψd θn += ηn − log 2S 
(d+1)
2
⎝  ⎞
↔ θn + (d+1)
2 = Ψd−1 ηn − log 2S 
⎝  ⎞
↔ θn = Ψd−1 ηn − log 2S  − (d+1)
2 = (∈ FS )−1 (ηn ) = ∈ FS∗ (ηn )

Finally, we obtain the MLE for θn in this sub family:


& ' ⎛
1   
N
(d + 1)
θ̂n = Ψd−1 log |X | − log 2S  −
N 2
i=1

Same formulation with source parameter n:

nd n n
4 Since |2S| = 2d |S|, we have 2 2 |S| 2 that is equivalent to |2S| 2 .
328 C. Saint-Jean and F. Nielsen
& ' ⎛
1   
N
n̂ − d − 1 (d + 1)
= Ψd−1 log |X | − log 2S  −
2 N 2
i=1
& ' ⎛
1   
N
n̂ = 2Ψd−1 log |X | − log 2S 
N
i=1

Dual log-normalizer FS∗ for Wd,S is

FS∗ (ηn ) = ⊂(∈ FS )−1 (ηn ), ηn ∗ − FS ((∈ FS )−1 (ηn ))


⎝  ⎞ (d + 1)
= ⊂Ψd−1 ηn − log 2S  − , ηn ∗
2
⎝  ⎞    ⎝  ⎞
− Ψd−1 ηn − log 2S  log 2S  − log Γd Ψd−1 ηn − log 2S 
⎝  ⎞ ⎝  ⎞ (d + 1)  ⎝  ⎞
= Ψd−1 ηn − log 2S  ηn − log 2S  − ηn − log Γd Ψd−1 ηn − log 2S 
2
1
KL(Wd,S ||Wd,S
2
) = B FS (θn 2 : θn 1 ) = FS (θn 2 ) − FS (θn 1 ) − ⊂θn 2 − θn 1 , ∈ FS (θn 1 )∗
 ⎟  ⎟
d +1   d +1
= θn 2 + log 2S  + log Γd θn 2 +
2 2
 ⎟  ⎟
d +1   d +1
− θn 1 + log 2S  − log Γd θn 1 +
2 2
 ⎟
  (d + 1)
− ⊂θn 2 − θn 1 , log 2S  + Ψd θn 1 + ∗
2
⎝ ⎞  ⎟
Γd θn 2 + d+1 (d + 1)
KL(Wd,S ||Wd,S ) = log
1 2
⎝ 2
⎞ − (θn 2 − θn 1 )Ψd θn 1 +
Γd θn 1 + d+1 2
2
⎝ ⎞  ⎟
Γd θ n 1 + 2 d+1
(d + 1)
= − log ⎝ ⎞ + (θ n 1 − θ n 2 )Ψ d θ n 1 +
Γd θn 2 + d+1 2
2

also with source parameter


 ⎝ n1 ⎞ ⎛  ⎟
Γd n1 − n2 n 
1
1
KL(Wd,S ||Wd,S
2
) = − log ⎝ n22 ⎞ + Ψd
Γd 2
2 2

Let us remark that this quantity does not depend on S.

B FS∗ (ηn 1 : ηn 2 ) = FS∗ (ηn 1 ) − FS∗ (ηn 2 ) − <ηn 1 − ηn 2 , ∈ FS∗ (ηn 2 )> H S
⎝  ⎞ ⎝  ⎞ (d + 1)
= Ψd−1 ηn 1 − log 2S  ηn 1 − log 2S  − ηn 1
  2
⎝   ⎞
− log Γd Ψd−1 ηn 1 − log 2S 
⎝  ⎞ ⎝  ⎞ (d + 1)
− Ψd−1 ηn 2 − log 2S  ηn 2 − log 2S  + ηn 2
2
11 Hartigan’s Method for k-MLE 329
 ⎝  ⎞
+ log Γd Ψd−1 ηn 2 − log 2S 
⎝  ⎞ (d + 1)
− ⊂ηn 1 − ηn 2 , Ψd−1 ηn 2 − log 2S  − ∗H S
  2
⎝  ⎞
Γd Ψd−1 ηn 2 − log 2S 
B FS∗ (ηn 1 : ηn 2 ) = log  ⎝  ⎞
Γd Ψd−1 ηn 1 − log 2S 
⎤ ⎝  ⎞ ⎝  ⎞⎥ ⎝  ⎞
− Ψd−1 ηn 2 − log 2S  − Ψd−1 ηn 1 − log 2S  ηn 1 − log 2S 

References

1. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley Series in
Probability and Statistics. Wiley-Interscience, New York (2008)
2. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J.
Mach. Learn. Res. 6, 1705–1749 (2005)
3. Nielsen, F.: k-MLE: a fast algorithm for learning statistical mixture models. In: International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 869–872 (2012). Long
version as arXiv:1203.5181
4. Jain, A.K.: Data clustering: 50 years beyond K -means. Pattern Recogn. Lett. 31, 651–666
(2010)
5. Wishart, J.: The generalised product moment distribution in samples from a Normal multivariate
population. Biometrika 20(1/2), 32–52 (1928)
6. Tsai, M.-T.: Maximum likelihood estimation of Wishart mean matrices under Lwner order
restrictions. J. Multivar. Anal. 98(5), 932–944 (2007)
7. Formont, P., Pascal, T., Vasile, G., Ovarlez, J.-P., Ferro-Famil, L.: Statistical classification for
heterogeneous polarimetric SAR images. IEEE J. Sel. Top. Sign. Proces. 5(3), 567–576 (2011)
8. Jian, B., Vemuri, B.: Multi-fiber reconstruction from diffusion MRI using mixture of wisharts
and sparse deconvolution. In: Information Processing in Medical Imaging, pp. 384–395,
Springer, Berlin (2007)
9. Cherian, A., Morellas, V., Papanikolopoulos, N., Bedros, S.: Dirichlet process mixture mod-
els on symmetric positive definite matrices for appearance clustering in video surveillance
applications. In: Computer Vision and Pattern Recognition (CVPR), pp. 3417–3424 (2011)
10. Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. http://arxiv.
org/abs/0911.4863.. Accessed Nov 2009
11. Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)
12. Wainwright, M.J., Jordan, M.J.: Graphical models, exponential families, and variational infer-
ence. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
EM algorithm. J. Roy. Stat. Soc. (Methodological). 39 1–38 (1977)
14. Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic ver-
sions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
15. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat.
Soc. C (Applied Statistics). 28(1), 100–108 (1979)
16. Telgarsky, M., Vattani, A.: Hartigan’s method: k-means clustering without Voronoi. In: Pro-
ceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp.
820–827 (2010)
330 C. Saint-Jean and F. Nielsen

17. Nielsen, F., Boissonnat, J.D., Nock, R.: On Bregman Voronoi diagrams. In: ACM-SIAM Sym-
posium on Discrete Algorithms, pp. 746–755 (2007)
18. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3),
645–678 (2005)
19. Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In:
International Conference on Machine Learning (ICML) (2012)
20. Ackermann, M.R.: Algorithms for the Bregman K -median problem. PhD thesis. Paderborn
University (2009)
21. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings
of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035
(2007)
22. Ji, S., Krishnapuram, B., Carin, L.: Variational Bayes for continuous hidden Markov models
and its application to active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 522–532
(2006)
23. Hidot, S., Saint-Jean, C.: An Expectation-Maximization algorithm for the Wishart mixture
model: application to movement clustering. Pattern Recogn. Lett. 31(14), 2318–2324 (2010)
24. Brent. R.P.: Algorithms for Minimization Without Derivatives. Courier Dover Publications,
Mineola (1973)
25. Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., Windham, M.P.: Local convergence
analysis of a grouped variable version of coordinate descent. J. Optim. Theory Appl. 54(3),
471–477 (1987)
26. Bogdan, K., Bogdan, M.: On existence of maximum likelihood estimators in exponential fam-
ilies. Statistics 34(2), 137–149 (2000)
27. Ciuperca, G., Ridolfi, A., Idier, J.: Penalized maximum likelihood estimator for normal mix-
tures. Scand. J. Stat. 30(1), 45–59 (2003)
28. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison:
variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–
2854 (2010)
29. Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: Inter-
national Conference on Pattern Recognition (ICPR), pp. 1723–1726 (2012)
30. Haff, L.R., Kim, P.T., Koo, J.-Y., Richards, D.: Minimax estimation for mixtures of Wishart
distributions. Ann. Stat. 39(6), 3417–3440 (2011)
31. Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5, 819–
844 (2004)
32. Moreno, P.J., Ho, P., Vasconcelos, N.: A Kullback-Leibler divergence based kernel for SVM
classification in multimedia applications. In: Advances in Neural Information Processing Sys-
tems (2003)
33. Petersen, K.B., Pedersen, M.S.: The matrix cookbook. http://www2.imm.dtu.dk/pubdb/p.php?
3274. Accessed Nov 2012
Chapter 12
Morphological Processing of Univariate
Gaussian Distribution-Valued Images Based
on Poincaré Upper-Half Plane Representation

Jesús Angulo and Santiago Velasco-Forero

Abstract Mathematical morphology is a nonlinear image processing methodology


based on the application of complete lattice theory to spatial structures. Let us con-
sider an image model where at each pixel is given a univariate Gaussian distri-
bution. This model is interesting to represent for each pixel the measured mean
intensity as well as the variance (or uncertainty) for such measurement. The aim of
this work is to formulate morphological operators for these images by embedding
Gaussian distribution pixel values on the Poincaré upper-half plane. More precisely,
it is explored how to endow this classical hyperbolic space with various families
of partial orderings which lead to a complete lattice structure. Properties of order
invariance are explored and application to morphological processing of univariate
Gaussian distribution-valued images is illustrated.

Keywords Ordered Poincaré half-plane · Hyperbolic partial ordering · Hyperbolic


complete lattice · Mathematical morphology · Gaussian-distribution valued image ·
Information geometry image filtering

12.1 Introduction

This work is motivated by the exploration of a mathematical image model f where


instead of having a scalar intensity t ⊂ R at each pixel p, i.e., f (p) = t, we have, a
univariate Gaussian probability distribution of intensities N(μ, α 2 ) ⊂ N , i.e., image
f is defined as the function

J. Angulo (B)
CMM-Centre de Morphologie Mathématique, Mathématiques et Systèmes,
MINES ParisTech, Paris, France
e-mail: [email protected]
S. Velasco-Forero
Department of Mathematics, National University of Singapore, Buona Vista, Singapore
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 331


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_12,
© Springer International Publishing Switzerland 2014
332 J. Angulo and S. Velasco-Forero

Δ∗N
f:
p ∈∗ N(μ, α 2 )

where Δ is the support space of pixels p (e.g., for 2D images Δ ⊂ Z2 ) and N denotes
the family of univariate Gaussian probability distribution functions (pdf). Nowadays
most of imaging sensors only produce single scalar values since the CCD (charge-
coupled device) cameras typically integrates the light (arriving photons) during a
given exposure time ∂ . To increase the signal-to-noise ratio (SNR), exposure time is
increased to ∂ ≡ = τ∂ , τ > 1. Let suppose that τ is a positive integer number, this is
equivalent to a multiple acquisition of τ frames during ∂ for each frame (i.e., a kind of
temporal oversampling). The standard approach only considers the sum (or average)
of the multiple intensities [28], without taking into account the variance which is
a basic estimator of the noise useful for probabilistic image processing. Another
example of such a representation from a gray scale image consists in considering that
each pixel is described by the mean and the variance of the intensity distribution from
its centered neighboring patch. This model has been for instance recently used in [10]
for computing local estimators which can be interpreted as pseudo-morphological
operators.
Let us consider the example of gray scale image parameterized by the mean and
the standard deviation of patches given in Fig. 12.1. We observe that the underlying
geometry of this space of patches is not Euclidean, e.g., the geodesics are clearly
curves. In fact, as we discuss in the paper, this parametrization corresponds to one
of the models of hyperbolic geometry.
Henceforth, the corresponding image processing operators should be able to deal
with Gaussian distributions-valued pixels. In particular, morphological operators for
images f ⊂ F(Δ, N ) involve that the space of Gaussian distributions N must be
endowed of a partial ordering leading to a complete lattice structure. In practice, it
means that given a set of Gaussian pdfs, as the example given in Fig. 12.2, we need
to be able to define a Gaussian pdf which corresponds to the infimum (inf) of the
set and another one to the supremum (sup). Mathematical morphology is a nonlin-
ear image processing methodology based on the computation of sup/inf-convolution
filters (i.e., dilation/erosion operators) in local neighborhoods [31]. Mathematical
morphology is theoretically formulated in the framework of complete lattices and
operators defined on them [21, 29]. When only the supremum or the infimum
are well defined, other morphological operators can be formulated in the frame-
work of complete semilattices [22, 23]. Both cases are considered here for images
f ⊂ F(Δ, N ).
A possible way to deal with the partial ordering problem of N can be founded
on stochastic ordering (or stochastic dominance) [30] which is basically defined in
terms of majorization of cumulative distribution functions.
However, we prefer to adopt here an information geometry approach [4], which
is based on considering that the univariate Gaussian pdfs are points in a hyper-
bolic space [3, 9]. More generally, Fisher geometry amounts to hyperbolic geometry
of constant curvature for other location-scale families of probability distributions
(Cauchy, Laplace, elliptical) p(x; μ, α) = α1 f x−μ α , where curvature depends on
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 333

Fig. 12.1 Parametrization of a gray scale image where each pixel is described by the mean and
the variance of the intensity distribution from its centered neighboring patch of 5 × 5 pixels: a Left,
original gray scale image; center, image of mean of each patch; right, image of standard deviation
of each patch. b Visualization of the patch according to their coordinates in the space mean/std.dev

0.8

0.7

0.6

0.5
fμ,σ 2(x)

0.4

0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
x

(a) (b)
Fig. 12.2 a Example of a set of nine univariate Gaussian pdfs, Nk (μk , αk2 ), 1 ∇ k ∇ 9. b Same

set of Gaussian pdfs represented as points of coordinates (xk = μk / 2, yk = αk ) in the upper-half
plane
334 J. Angulo and S. Velasco-Forero

the dimension and the density profile [3, 14, 15]. For a deep flavor on hyperbolic
geometry see [12]. There are several models representing the hyperbolic space in
Rd , d > 1, such as the three following ones: the (Poincaré) upper half-space model
Hd , the Poincaré disk model P d and the Klein disk model Kd .
1. The (Poincaré) upper half-space model is the domain Hd = {(x1 , · · · , xd ) ⊂
dx12 +···+dxd2
Rd | xd > 0} with the Riemannian metric ds2 = xd2
;
2. The Poincaré disk model is the domain P d = {(x1 , . . . , xd ) ⊂ Rd | x12 + · · · +
dx 2 +···+dx 2
xd2 < 1} with the Riemannian metric ds2 = 4 (1−x1 2 −···−x2d)2 ;
1 d
3. The Klein disk model is the space Kd = {(x1 , . . . , xd ) ⊂ Rd | x12 +· · · +xd2 < 1}
dx12 +···+dxd2
+ (x(1−x
1 dx1 +···+xd dxd )
2
with the Riemannian metric ds2 = 1−x12 −···−xd2 2 −···−x 2 )2 .
1 d

These models are isomorphic between them in the sense that one-to-one correspon-
dences can be set up between the points and lines in one model to the points and
lines in the other so as to preserve the relations of incidence, betweenness and con-
gruence. In particular, there exists an isometric mapping between any pair among
these models and analytical transformations to convert from one to another are well
known [12].
Klein disk model has been considered for instance in computational information
geometry (Voronoï diagrams, clustering, etc.) [26] and Poincaré disk model in infor-
mation geometric radar processing [5–7]. In this paper, we focus on the Poincaré
half-plane model, H2 , which is sufficient for our practical purposes of manipulating
Gaussian pdfs. Figure 12.2b illustrates the example ◦ of a set of nine Gaussian pdfs
Nk (μk , αk2 ) represented as points of coordinates (μk / 2, αk ) in the upper-half plane
as follows: μ
(μ, α) ∈∗ z = ◦ + iα.
2

The rationale behind the scaling factor 2 is given in Sect. 12.3.
In summary, from a theoretical viewpoint, the aim of this paper is to endow
H2 with partial orderings which lead to useful invariance properties in order to
formulate appropriate morphological operators for images f : Δ ∗ H2 . This work is
an extension to the conference paper [1]. The rest of the paper is organized as follows.
Section 12.2 reminds the basics on the geometry of Poincaré half-plane model. The
connection between Poincaré half-plane model of hyperbolic geometry and Fisher
Information geometry of Gaussian distributions is briefly recalled in Sect. 12.3. Then,
various partial orderings on H2 are studied in Sect. 12.4. Based on the corresponding
complete lattice structure of H2 , Sect. 12.5 presents definition of morphological
operators for images on F(Δ, H2 ) and its application to morphological processing
univariate Gaussian distribution-valued images. Section 12.6 concludes the paper
with the perspectives of the present work.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 335

12.2 Geometry of Poincaré Upper-Half Plane H2

In complex analysis, the upper-half plane is the set of complex numbers with positive
imaginary part:
H2 = {z = x + iy ⊂ C | y > 0} . (12.1)

We also use the notation x = →(z) and y = ∅(z). The boundary of upper-half plane
(called sometimes circle at infinity) is the real axis together with the infinity, i.e.,
δH2 = R ← ≤ = {z = x + iy | y = 0, x = ±≤, y = ≤}.

12.2.1 Riemannian Metric, Angle and Distance

In hyperbolic geometry, the Poincaré upper-half plane model (originated with


Beltrami and also known as Lobachevskii space in Soviet scientific literature) is
the space H2 together with the Poincaré metric
⎛ 1

y2
0
(gkl )k,l=1,2 = 1 (12.2)
0 y2

such that the hyperbolic arc length is given by

⎞ dx 2 + dy2 |dz|2
ds2 = gkl dx dy = = = y−1 dz y−1 dz∗ . (12.3)
y2 y2
k,l=1,2

With this metric, the Poincaré upper-half plane is a complete Riemannian manifold
of constant sectional curvature K equal to −1. We can consider a continuum of
other hyperbolic spaces by multiplying the hyperbolic arc length (12.3) by a positive
constant k which leads to a metric of constant Gaussian curvature K = −1/k 2 . The
tangent space to H2 at a point z is defined as the space of tangent vectors at z. It
has the structure of a 2-dimensional real vector space, Tz H2  R2 . The Riemannian
metric (12.3) is induced by the following inner product on Tz H2 : for ψ1 , ψ2 ⊂ Tz H2 ,
with ψk = (φk , βk ), we put
(ψ1 , ψ2 )
⊆ψ1 , ψ2 ∧z = (12.4)
∅(z)2

which is a scalar multiple of the Euclidean inner product (ψ1 , ψ2 ) = φ1 φ2 + β1 β2 .


The angle γ between two geodesics in H2 at their intersection point z is defined
as the angle between their tangent vectors in Tz H2 , i.e.,

⊆ψ1 , ψ2 ∧z (ψ1 , ψ2 )
cosγ = =◦ ◦ . (12.5)
≥ψ1 ≥z ≥ψ2 ≥z (ψ1 , ψ1 ) (ψ2 , ψ2 )
336 J. Angulo and S. Velasco-Forero

We see that this notion of angle measure coincides with the Euclidean angle measure.
Consequently, the Poincaré upper-half plane is a conformal model.  
The distance between two points z1 = x1 + iy1 and z2 = x2 + iy2 in H2 , ds2 is
the function
⎠ ⎧
(x1 − x2 )2 + (y1 − y2 )2
distH2 (z1 , z2 ) = cosh−1 1 + (12.6)
2y1 y2

Distance (12.6) is derived from the logarithm of the cross-ratio between these two
points and the points at the infinity, i.e., distH2 (z1 , z2 ) = log D(z1≤ , z1 , z2 , z2≤ )
z −z≤ z −z≤
where D(z1≤ , z1 , z2 , z2≤ ) = z11 −z2≤ z22 −z1≤ . To obtain their equivalence, we remind
◦ 1 2
that cosh−1 (x) = log(x + x 2 − 1). From this formulation⎨ it is⎩easy ⎤⎨ to check that
⎨ ⎨
for two points with x1 = x2 the distance is distH2 (z1 , z2 ) = ⎨log yy21 ⎨.
To see that distH2 (z1 , z2 ) is a metric distance in H2 , we first notice the argument
−x
of cosh−1 always lies in [1, ≤) and cosh(x) = e +e
x
2 , so cosh is increasing and
concave on [0, ≤). Thus cosh−1 (1) = 0 and cosh−1 is increasing and concave
down on [1, ≤), growing logarithmically. The properties required to be a metric
(non-negativity, symmetry and triangle inequality) are proven using the cross-ratio
formulation of the distance.
We note that the distance from any point z ⊂ H2 to δH2 is infinity.

12.2.2 Geodesics

The geodesics of H2 are the vertical lines, VL(a) = {z ⊂ H2 | →(z) = a}, and the
semi-circles in H2 which meet the horizontal axis →(z) = 0 orthogonally, SCr (a) =
{z ⊂ H2 | |z −z≡ | = r; →(z≡ ) = a and ∅(z≡ ) = 0}; see Fig. 12.3a. Thus given any pair
z1 , z2 ⊂ H2 , there is a unique geodesic connecting them, parameterized for instance
in polar coordinates by the angle, i.e.,

ω(t) = a + reit = (a + r cos t) + i(r sin t), γ1 ∇ t ∇ γ2 (12.7)

where z1 = a + reiγ1 and z2 = a + reiγ2 .


A more useful expression of the geodesics involves explicitly the cartesian coor-
dinates of the pair of points. First, given the pair z1 , z2 ⊂ H2 , x1 ∀= x2 , the semi-circle
orthogonal to x-axis connecting them has a center c = (a1Φ2 , 0) and a radius r1Φ2 ,
(z1 , z2 ) ∈∗ SCr1Φ2 (a1Φ2 ), where
⎥ ⎥
x22 − x12 + y22 − y12
a1Φ2 = ; r1Φ2 = (x1 − a1Φ2 )2 + y12 = (x2 − a1Φ2 )2 + y22 .
2(x2 − x1 )
(12.8)
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 337

Fig. 12.3 a Geodesics of H2 : z1 and z2 are connected by a unique semi-circle; the geodesic between
z2 and z3 is a segment of vertical line. b Hyperbolic polar coordinates

Then, the unique geodesic parameterized by the length, t ∈∗ ω(z1 , z2 ; t), ω : [0, 1] ∗
H2 joining two points z1 = x1 + iy1 and z2 = x2 + iy2 such as ω(z1 , z2 ; 0) = z1 and
ω(z1 , z2 ; 1) = z2 is given by

x1 + ieφt+t0 ⎬ ⎭ if x1 = x2
ω(z1 , z2 ; t) = (12.9)
[r tanh(φt + t0 ) + a] + i r
cosh(φt+t0 ) if x1 ∀= x2

with a and r given in (12.8) and where for x1 = x2 , t0 = log(y1 ), φ = log yy21 and for
x1 ∀= x2
 ⎥ ⎜
⎠ ⎧ ⎠ ⎧ r + r 2 − y2
r x1 − a y 2
t0 = cosh−1 = sinh−1 , φ = log  .
1

y1 y1 y2 r + r 2 − y2
1

If we take the parameterized smooth curve ω(t) = x(t) + iy(t), where x(t) and
y(t) are continuously differentiable for b ∇ t ∇ c, then the hyperbolic length along
the curve is determined by integrating the metric (12.3) as:
   ⎟ 
b b ẋ(t)2 + ẏ(t)2 b |ż(t)|
L(ω) = ds = |ω̇(t)|ω dt = dt = dt.
ω a a y(t) a y(t)

Note that this expression is independent from the parameter choice. Hence, using
the polar angle parametrization (12.7), we obtain an alternative expression of the
geodesic distance given
 ⎨ ⎠ ⎧ ⎠ ⎧⎨
γ2 r ⎨ γ2 γ1 ⎨⎨
distH2 (z1 , z2 ) = inf L(ω) = dt = ⎨⎨log cot − log cot
ω γ1 r sin t 2 2 ⎨

which is independent of r and consequently, as described below, dilation is an


isometry.
338 J. Angulo and S. Velasco-Forero

12
(a) (b) 0.45
10 0.4

0.35
8
0.3
y = Im(z)

fμ,σ 2 (x)
0.25
6
0.2
4 0.15

0.1
2
0.05

0 0
−8 −6 −4 −2 0 2 4 −15 −10 −5 0 5 10 15
x = Re(z) x

Fig. 12.4 a Example of interpolation of 5 points in H2 between points z1 = −6.3 + i2.6 (in red)
and z2 = 3.5 + i0.95 (in blue) using their geodesic t ∈∗ ω(z1 , z2 ; t), with t = 0.2, 0.4, 0.5, 0.6,
0.8. The average point (in green) corresponds just to ω(z1 , z2 ; 0.5) = 0.89 + i4.6. b Original (in
red and blue) univariate Gaussian pdfs and corresponding interpolated ones

Remark Interpolation between two univariate normal distributions. Using the closed-
form expression of geodesics t ∈∗ ω(z1 , z2 ; t), given in (12.9), it is possible to
compute the average univariate Gaussian pdf between N(μ1 , α12 ) and N(μ2 , α22 ),

with (μk = 2xk , αk = yk ), by taking t = 0.5. More generally, we can interpolate a
series of distributions between them by discretizing t between 0 and 1. An example of
such a method is given in Fig. 12.4. We note in particular that the average Gaussian pdf
can have a variance bigger than α12 and α22 . We note also that, due to the “logarithmic
scale” of imaginary axis, equally spaces points in t do not have equal Euclidean
arc-length in the semi-circle.

12.2.3 Hyperbolic Polar Coordinates

The position of point z = x + iy in H2 can be given either in terms of Cartesian coor-


dinates (x, y) or by means of polar hyperbolic coordinates (β, ξ), where β represents
the distance of the point from the origin OH2 = (0, 1) and ξ represents the slope of
the tangent in OH2 to the geodesic (i.e., semi-circle) joining the point (x, y) with the
origin. The formulas which relate the hyperbolic coordinates (β, ξ) to the Cartesian
ones (x, y) are [11]
⎫ ⎫
sinh β cos ξ
x= cosh β−sinh β sin ξ , β>0 β = distH2 (OH2 , z)
x 2 +y2 −1 (12.10)
y= cosh β−sinh β sin ξ ,
1
− η2 < ξ < η
2 ξ = arctan 2x

We notice that the center of the geodesic passing through (x, y) from OH2 has
Cartesian coordinates given by (tan ξ, 0); see Fig. 12.3b.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 339

12.2.4 Invariance and Isometric Symmetry

Let the projective special linear group defined by PSL(2, R) = SL(2, R)/ {±I} where
the special linear group SL(2, R) consists of 2 × 2 matrices with real entries which
determinant equals +1, i.e.,
⎠ ⎧
ab
g ⊂ SL(2, R) : g = , ad − bc = 1;
cd

and I denotes the identity matrix. This defines the group of Möbius transformations
Mg : H2 ∗ H2 by setting for each g ⊂ SL(2, R),
⎠ ⎧
ab az + b ac|z|2 + bd + (ad + bc)→(z) + i∅(z)
z ∈∗ Mg (z) = ·z = = ,
cd cz + d |cz + d|2
   
such that ∅ Mg (z) = (y(ad − bc)) / (cx + d)2 + (cy)2 > 0. The inverse map is
easily computed, i.e., z ∈∗ Mg−1 (z) = (dz − b)/(−cz + a). Since Möbius transfor-
mations are well defined in H2 and map H2 to H2 homeomorphically.
The Lie group PSL(2, R) acts on the upper half-plane by preserving the hyperbolic
distance, i.e.,

distH2 (Mg (z1 ), Mg (z2 )) = distH2 (z1 , z2 ), √g ⊂ SL(2, R), √z1 , z2 ⊂ H2 .

This includes three basic types of transformations: (1) translations z ∈∗ z + τ,


τ ⊂ R; (2) scaling z ∈∗ κz, κ ⊂ R \ 0; (3) inversion z ∈∗ z−1 . More pre-
cisely, any generic Möbius transformation Mg (z), c ∀= 0, can be decomposed
into the following four maps: f1 = z + d/c, f2 = −1/z, f3 = z(ad − bc)/c2 ,
f4 = z + a/c such that Mg (z) = (f1 ⊗ f2 ⊗ f3 ⊗ f4 )(z). If c = 0, we have Mg (z) =
(h1 ⊗ h2 ⊗ h3 )(z) where h1 (z) = az, h2 (z) = z + b, h3 (z) = z/d. It can be
proved that Möbius transformations take circles to circles. Hence, given a cir-
cle in the complex plane C of radius r and center c, denoted by Cr (c), we have
its following mappings [27]: a translation z ∈∗ z + τ, such as the functions f1 ,
f4 and h2 maps Cr (c) to Cr (c + τ); a scaling z ∈∗ κz, such as the functions
f3 , h1 and h3 , maps Cr (c) to Cκr (κc); for inversion z ∈∗ z−1 , Cr (c) maps to
Cr/|cz| (−1/c).
Let H ⊂ H2 be a geodesic of the upper half-plane, which is described uniquely
by its endpoints in δH2 , there exists a Möbius transformation Mg such that Mg maps
H bijectively to the imaginary axis, i.e., VL(0). If H is the vertical line VL(a), the
transformation is the translation z ∈∗ Mg (z) = z − a. If H is the semi-circle SCr (a)
with endpoints in real axis being ψ− , ψ+ ⊂ R, where ψ− = a − r and ψ+ = a + r,
the map is given by Mg (z) = z−ψ z−ψ+ , such that Mg (ψ− ) = 0, Mg (ψ+ ) = ≤ and

Mg (a + ir) = i.
340 J. Angulo and S. Velasco-Forero

The unit-speed geodesic going up vertically, through the point z = i is given by


⎠ t/2 ⎧
e 0
ω(t) = · i = iet .
0 e−t/2

Because PSL(2, R) acts transitively by isometries of the upper half-plane, this geo-
desic is mapped into other geodesics through the action of PSL(2, R). Thus, the
general unit-speed geodesic is given by
⎠ ⎧ ⎠ t/2 ⎧
ab e 0 aiet + b
ω(t) = −t/2 ·i = . (12.11)
cd 0 e ciet + d

12.2.5 Hyperbolic Circles and Balls

Let consider an Euclidean circle of center c =⎟(xc , yc ) ⊂ H2 and radius r in the


upper-half plane, defined as Cr (c) = {z ⊂ H2 | (xc − a)2 + (yc − b)2 = r}, such
that it is contained in the upper-half plane, i.e., Cr (c) ⊂ H2 . The corresponding
hyperbolic circle CH2 ,rh (ch ) {z ⊂ H2 |distH2 (ch , z) = rh } is geometrically equal to
Cr (c) but its hyperbolic center and radius are given by
⎥ ⎠ ⎧
r
ch = (xc , yc2 − r 2 ); rh = tanh−1 .
yc

We note that the hyperbolic center is always below the Euclidean center. The inverse
equations are
c = (xc = xh , yc = yh cosh rh ); r = yh sinh rh . (12.12)

Naturally, the hyperbolic ball of center ch and radius rh is defined by BH2 ,rh (ch )
{z ⊂ H2 |distH2 (ch , z) ∇ rh }. Let us consider a hyperbolic ball centered at the origin
BH2 ,rh (0, 1), parameterized by its boundary curve δB in Euclidean coordinates:

x = r cos γ; y = b + r sin γ

where using (12.12), we have b = cosh rh and r = sinh rh . The length of the boundary
and area of this ball are respectively given by [32]:
 2η
r
L (δB) = dγ = 2η sinh rh , (12.13)
0 b + r sin γ
  
dxdy dx
Area(B) = 2
= = 2η(cosh rh − 1). (12.14)
B y ω y
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 341

Comparing the values of an Euclidean ball which has area ηrh2 and length of its
rh3 rh5
boundary circle 2ηrh , and considering the Taylor series sinh rh = rh + 3! + 5! +···
rh2 rh4
and cosh rh = 1 + + + · · · , one can note that the hyperbolic space is much
2! 4!
larger than the Euclidean one. Curvature is defined through derivatives of the metric,
but the fact that infinitesimally the hyperbolic ball grows faster than the Euclidean
balls can be used a measure of the curvature of the space at the origin (0, 1) [32]:
−L(δB)]
K = limrh ∗0 3[2ηrhηr 3 = −1. Since there is an isometry that maps the neigh-
h
borhood of any point to the neighborhood of the origin, the curvature of hyperbolic
space is identically constant to −1.
Remark Minimax center in H2 . Finding the smallest circle that contains the whole
set of points x1 , x2 , . . . , xN in the Euclidean plane is a classical problem in com-
putational geometry, called the minimum enclosing circle MEC. It is also rel-
evant its statistical estimation since the unique center of the circle c≤ (called
1-center or minimax center) is defined as the L ≤ center of mass, i.e., for R2 ,
c≤ = arg minx⊂R2 max1∇i∇N ≥xi − x≥2 . Computing the smallest enclosing sphere
in Euclidean spaces is intractable in high dimensions, but efficient approximation
algorithms have been proposed. The Bădoiu and Clarkson algorithm [8] leads to a
fast and
⎬ simple
⎭ approximation (of known precision ν after a given number of itera-
1
tions ν2 using the notion of core-set, but independent from dimensionality n). The
computation of the minimax center is particularly relevant in information geometry
(smallest enclosing information disk [25]) and has been considered for hyperbolic
models such as the Klein disk, using a Riemannian extension of Bădoiu and Clarkson
algorithm [2], which only requires a closed-form of the geodesics. Figure 12.5 depicts
an example of minimax center computation using Bădoiu and Clarkson algorithm for
a set of univariate Gaussian pdfs represented in H2 . We note that, using this property
of circle preservation, the computation the minimal enclosing hyperbolic circle of a
given set of points Z = {zk }1∇k∇K , zk ⊂ H2 , denoted MECH2 (Z) is equivalent to
computing the corresponding minimal enclosing circle MEC(Z) if and only if we
have MEC(Z) ⊂ H2 . This is the case for the example given in Fig. 12.5.

12.3 Fisher Information Metric and α-Order Entropy Metric


of Univariate Normal Distributions

In information geometry, the Fisher information metric is a particular Riemannian


metric which can be associated to a smooth manifold whose points are probability
measures defined on a common probability space [3, 4]. It can be obtained as the
infinitesimal form of the Kullback–Leibler divergence (relative entropy). An alterna-
tive formulation is obtained by computing the negative of the Hessian of the Shannon
entropy.
Given an univariate probability distribution p(x|γ), x ⊂ X, it can be viewed as a
point on a statistical manifold with coordinates given by γ = (γ1 , γ2 , . . . , γn ). The
342 J. Angulo and S. Velasco-Forero

(a) (b)
2.5 1
0.9
2 0.8
0.7
1.5
y = Im(z)

0.6

fμ,σ2 (x)
0.5
1 0.4
0.3
0.5 0.2
0.1
0 0
−1.5 −1 −0.5 0 0.5 1 −4 −3 −2 −1 0 1 2 3 4
x
x = Re(z)

Fig. 12.5 a Example of minimax center (xh , yh ) (red ×) of a set of nine points Z = {zk }1∇k∇9 in
H2 (original points ∗ in black), the minimal enclosing◦
circle MECH2 (Z) is also depicted (in red).
b Corresponding minimax center Gaussian set N(μ = 2xh , α 2 = yh2 ) of nine univariate Gaussian
pdfs, Nk (μk , αk2 ), 1 ∇ k ∇ 9

Fisher information matrix then takes the form:



δ log p(x, γ) δ log p(x, γ)
gkl (γ)k,l=1,2 = p(x, γ) dx.
X δγk δγl

The corresponding positive definite form


n
ds2 (γ) = gkl (γ)dγk dγl
k,l=1

is defined as the Fisher information metric. In the univariate Gaussian distributed case
p(x|γ) ↔ N(μ, α 2 ), we have in particular γ = (μ, α) and it can be easily deduced
that the Fisher information matrix is
⎛ ⎝
1
0
(gkl (μ, α)) = α 2 2
(12.15)
0 α2

and the corresponding metric is


⎠ ⎧
dμ2 + 2dα 2 dμ2
ds ((μ, α)) =
2
= 2α −2 ◦ + dα .
2
(12.16)
α2 2

Therefore, the Fisher information geometry of univariate normal distribution is essen-


tially the geometry of the Poincaré upper-half plane with the following change of
variables: ◦
x = μ/ 2, y = α
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 343

Hence, given two univariate Gaussian pdfs N(μ1 , α12 ) and N(μ2 , α22 ), the Fisher
distance between them, distFisher : N × N ∗ R+ , defined from the Fisher informa-
tion metric is given by [9, 14]:
⎩ ⎤ ◦ ⎠ ⎧
μ1 μ2
distFisher (μ1 , α1 ), (μ2 , α2 ) = 2distH2 ◦ + iα1 , ◦ + iα2 .
2 2
(12.17)
2 2

The change of variable also involves that the geodesics in the hyperbolic Fisher
space of normal distributions
◦ are half-lines and half-ellipses orthogonal at α = 0,
with eccentricity 1/ 2.
The canonic approach can be generalized according to Burbea and Rao geometric
framework [9], which is based on replacing the Shannon entropy by the notion of
τ-order entropy, which associated Hessian metric leads to an extended large class of
information metric geometries. Focussing on the particular case of univariate normal
distributions, p(x|γ) ↔ N(μ, α 2 ), we consider again points in the upper half-plane,
z = x + iy ⊂ H2 and for a given τ > 0 the τ−order entropy metric is given by [9]:

x = [A(τ)]−1/2 μ, y = α;
(12.18)
dsτ = B(τ)y−(τ+1) (dx 2 + dy2 );

where

A(τ) = (τ1/2 −τ−1/2 )2 +2τ−1 ; B(τ) = τ−3/2 (2η)(1−τ)/2 A(τ); τ > 0. (12.19)

The metric in (12.18) constitutes a Kähler metric on H2 and when τ = 1


reduces to the Poincaré metric (12.3). Its Gaussian curvature is Kτ (z) = −(τ + 1)
[B(τ)]−1 yτ−1 ; being always negative (hyperbolic geometry). In particular, for τ = 1
we recover the particular constant case K1 (z) = −1.
The geodesics of the Burbea-Rao τ-order entropy metric can be written in its
parametric polar form as [9]:

ω(γ) = x(γ) + iy(γ), 0 < γ < η, with (12.20)



x(γ) = a + r 1/κ F1/κ (γ),
(12.21)
y(γ) = r 1/κ sin1/κ γ,

where
κ = (τ + 1)/2, r > 0, a ⊂ R,

Fω (γ) = −ω η/2 sinω tdt.

Figure 12.6 shows examples of the geodesics from the Burbea-Rao τ-order entropy
metric for a = 0, r = 1 and τ = 0.01, 0.5, 1, 5 and 20.
By integration of the metric, it is obtained the Burbea-Rao τ-order entropy
geodesic distance for z1 , z2 ⊂ H2 [9]:
344 J. Angulo and S. Velasco-Forero

Fig. 12.6 Examples of the Burbea−Rao α −order entropy geodesics


2.5
geodesics from the Burbea- α = 0.01
α = 0.5
Rao τ−order entropy metric α=1
α=5
obtained using (12.20) and 2 α = 20

(12.21) for a = 0, r = 1 and


τ = 0.01, 0.5, 1, 5 and 20
1.5

0.5

0
−1.5 −1 −0.5 0 0.5 1 1.5

◦ ⎨ ⎥ ⎥ ⎨
2 B(τ) ⎨ x1 − x2 (1−τ)/2 ⎨
distH2 (z1 , z2 ; τ) = ⎨ + y 1 − r −2 yτ+1 − y(1−τ)/2 1 − r −2 yτ+1 ⎨ ,
|1 − τ| ⎨ r 1 1 2 2 ⎨
(12.22)

which unfortunately depends on the value of r. This quantity should be determined


by solving a system of three nonlinear equations for the unknown variables γ1 , γ2
and r: ⎪  
 x1 − x2 = r 1/κ F1/κ (γ1 ) − F1/κ (γ2 ) ,
y = r 1/κ sin1/κ γ1 ,
 1
y2 = r 1/κ sin1/κ γ2 .

An alternative solution to compute a closed form distance between two univari-


ate normal distributions N(μ1 , α12 ) and N(μ2 , α22 ) according to the Burbea-Rao
τ-deformed geometry is based on the τ-order Hellinger distance [9]:

  2(2η)(1−τ)/4 ⎩ (1−τ)/2 ⎤
(1−τ)/2 2
distHellinger (μ1 , α12 ), (μ2 , α22 ); τ = α1 − α2 + 2(α1 α2 )(1−τ)/2
τ5/4
 ⎛ ⎝1/2 ⎛ ⎝⎜1/2
1 − 2α1 α2 exp
−τ(μ1 − μ2 )2 
.
α12 + α22 4(α12 + α22 )

(12.23)

In particular, when τ = 1 this formula reduces to


 ⎛ ⎝1/2 ⎛ ⎝⎜1/2
⎩ ⎤ 2α1 α2 −(μ1 − μ2 )2 
distHellinger (μ1 , α12 ), (μ2 , α22 ) = 23/2 1 − exp .
α12 + α22 4(α12 + α22 )
(12.24)
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 345

12.4 Endowing H2 with Partial Ordering and Its Complete


(inf-semi) Lattice Structures

The notion of ordering invariance in the Poincaré upper-half plane was considered
in the Soviet literature [19, 20]. Ordering invariance with respect to simple tran-
sitive subgroup T of the group of motions was studied, i.e., group T consists of
transformations t of the form:

z = x + iy ∈∗ z≡ = (λx + τ) + iλy,

where λ > 0 and τ are real numbers. We named T the Guts group. We note that T
is just the composition of a translation and a scaling in H2 , and consequently, T is
an isometric group (see Sect. 2.4).
Nevertheless, up to the best of our knowledge, the formulation of partial orders
on Poincaré upper-half plane has not been widely studied. We introduce here partial
orders in H2 and study invariance properties to transformations of Guts group or to
other subgroups of SL(2, R) (Möbius transformations).

12.4.1 Upper Half-Plane Product Ordering

A real vector space E on which a partial order ∇ is given (reflexive, transitive,


antisymmetric) is called an ordered vector space if (i) x, y, z ⊂ E and x ∇ y implies
x + z ∇ y + z; (ii) x, y ⊂ E, 0 ∇ λ ⊂ R, and x ∇ y implies λx ∇ λy. Let us consider
that the partial order ∇ is the product order. An element x ⊂ E with x ∓ 0 (it means
that all the vector components are positive) is said to be positive. The set E+ =
{x ⊂ E | x ∓ 0} for all positive elements is called the cone of positive elements. It
turns out that the order of an ordered vector space is determined by the set of positive
elements. Let E be a vector space and C ⊂ E a cone. Then, x ∇ y if x − y ⊂ C
defines an order on E such that E is an ordered vector space with E+ = C. The
notion of partially ordered vector space is naturally extended to partially ordered
groups [18]. An ordered vector space E is called a vector lattice (E, ∇) if √x, y ⊂ E
there exists the joint (supremum or least upper bound) x ∨ y = sup(x, y) ⊂ E and
the meet (infimum or greatest lower bound) x ˘ y = inf(x, y) ⊂ E. A vector lattice
is also called a Riesz space.
Thus, we can introduce a similar order structure in H2 as a product order of R×R+ .
To achieve this goal, we need to define, on the one hand, the equivalent of ordering
preserving linear combination. More precisely, given three points z1 , z2 , z3 ⊂ H2
and a scalar positive number 0 ∇ λ ⊂ R we say that

z1 ∇H2 z2 implies λ  z1  z3 ∇H2 λ  z2  z3 ,


346 J. Angulo and S. Velasco-Forero

where we have introduced the following pair of operations in H2 :

λ  z = λx + iyλ and z1  z2 = (x1 + x2 ) + i(y1 y2 ).

On the other hand, the corresponding partial ordering ∇H2 will be determined by
the positive cone in H2 defined by H+
2 = {z ⊂ H2 | x ∓ 0 and y ∓ 1}, i.e.,

z1 ∇H2 z2 ˇ z2  z1 ⊂ H+
2
, (12.25)

with z2  z1 = (x2 − x1 ) + i(y2−1 y1 ). According to this partial ordering the corre-


sponding supremum and infimum for any pair of points z1 and z2 in H2 are formulated
as follows

z1 ∨H2 z2 = (x1 ∨ x2 ) + i exp (log(y1 ) ∨ log(y2 )) , (12.26)


z1 ˘H2 z2 = (x1 ˘ x2 ) + i exp (log(y1 ) ˘ log(y2 )) . (12.27)

Therefore H2 endowed with partial ordering (12.25) is a complete lattice, but it is


not bounded since the greatest (or top) and least (or bottom) elements are in the
boundary δH2 . We also have a duality between supremum and infimum, i.e.,
   
z1 ∨H2 z2 =  z1 ˘H2 z2 ; z1 ˘H2 z2 =  z1 ∨H2 z2 ,

with respect to the following involution

z ∈∗ z = (−1)  z = −x + iy−1 . (12.28)

We easily note that, in fact, exp (log(y1 ) ∨ log(y2 )) = y1 ∨ y2 and similarly for the
infimum, since the logarithm is an isotone mapping (i.e., monotone increasing) and
therefore order-preserving. Therefore, the partial ordering ∇H2 does not involve any
particular structure for H2 and does not take into account the Riemannian nature of
the upper half plane. According to that, we note also that the partial ordering ∇H2 is
invariant to the Guts group of transforms, i.e.,

z1 ∇H2 z2 ˇ T (z1 ) ∇H2 T (z2 ).

12.4.2 Upper Half-Plane Symmetric Ordering

Let us consider a symmetrization of the product ordering with respect to the origin
in the upper half-plane. Given any pair of points z1 , z2 ⊂ H2 , we define the upper
half-plane symmetric ordering as
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 347

 0 ∇ x1 ∇ x2


and 0 ∇ log(y1 ) ∇ log(y2 ) or
x2 ∇ x1 ∇ 0 and 0 ∇ log(y1 ) ∇ log(y2 ) or
z1 ˆH 2 z2 ˇ (12.29)

 x2 ∇ x1 ∇ 0 and log(y2 ) ∇ log(y1 ) ∇ 0 or

0 ∇ x1 ∇ x2 and log(y2 ) ∇ log(y1 ) ∇ 0

The four conditions of this partial ordering entails that only points belonging to
the same quadrant of H2 can be ordered, where the four quadrants {H++ 2 , H2 ,
−+
H−− , H+− } are defined with respect to the origin OH2 = (0, 1) which corresponds
2 2

to the pure imaginary complex z0 = i. In other words, we can summarize the partial
ordering (12.29) by saying that if z1 and z2 belong to the same O-quadrant of H2
we have z1 ˆH2 z2 ˇ |x1 | ∇ |x2 | and | log(x1 )| ∇ | log(x2 )|. Endowed with the
partial ordering (12.29), H2 becomes a partially ordered set (poset) where the bottom
element is z0 , but we notice that there is no top element. In addition, for any pair of
point z1 and z2 , the infimum H2 is given by


 (x1 ˘ x2 ) + i(y1 ˘ y2 ) if z1 , z2 ⊂ H++
2


 (x1 ∨ x2 ) + i(y1 ˘ y2 ) if z1 , z2 ⊂ H−+
2

z1 H2 z2 ˇ (x1 ∨ x2 ) + i(y1 ∨ y2 ) if z1 , z2 ⊂ H−−


2 (12.30)


 (x1 ˘ x2 ) + i(y1 ∨ y2 )
 if z1 , z2 ⊂ H+−
2

z0 otherwise

 to any finite set of points in H , Z =


The infimum (12.30) extends naturally 2

{zk }1∇k∇K , and will be denoted by H2 Z. However, the supremum z1 H2 z2 is


not defined; or more precisely, it is defined if and only if z1 and z2 belong to the
same quadrant, i.e., similarly to (12.30) mutatis mutandis ˘ by ∨ with the “other-
wise” case as “non existent”. Consequently, the poset (H2 , ˆH2 ) is only a complete
inf-semilattice. The fundamental property of such infimum (12.30) is its self-duality
with respect to involution (12.28), i.e.,
 
z1 H2 z2 =  z1 H2 z2 . (12.31)

Due to the strong dependency of partial ordering ˆH2 with respect to OH2 , it is
easy to see that such ordering is only invariant to transformations that does not move
points from one quadrant to another one. This is the case typically for mappings as
z ∈∗ λ  z, λ > 0.

12.4.3 Upper Half-Plane Polar Ordering


Previous order ˆH2 is only a partial ordering, and consequently given any pair of
points z1 and z2 , the infimum z1 H2 z2 can be different from z1 and z2 . In addition,
the supremum is not always defined. Let us introduce a total ordering in H based on
hyperbolic polar coordinates, which also takes into account an ordering relationship
with respect to OH2 . Thus, given two points √z1 , z2 ⊂ H the upper half-plane polar
ordering states
348 J. Angulo and S. Velasco-Forero

pol β1 < β2 or
z1 ∇H 2 z2 ˇ (12.32)
β1 = β2 and tan ξ1 ∇ tan ξ2

pol
where (β, ξ) are defined in Eq. (12.10). The polar supremum z1 ∨H2 z2 and infimum
pol
z1 ˘H2 z2 are naturally obtained from the order (12.32) for any subset of points Z,
pol !pol pol
denoted by H2 Z and H2 Z. Total order ∇H2 leads to a complete lattice, bounded
pol
from the bottom (i.e., the origin OH2 ) but not from the top. Furthermore, as ∇H2 is
a total ordering, the supremum and the infimum will be either z1 or z2 .
Polar total order is invariant to any Möbius transformation Mg which preserves
the distance to the origin (isometry group) and more generally to isotone maps in
distance, i.e., β(z1 ) ∇ β(z2 ) ˇ β(Mg (z1 )) ∇ β(Mg (z2 )) but which also preserves
the orientation order, i.e., order on the polar angle. This is for instance the case of
orientation group SO(2) and the scaling maps z ∈∗ Mg (z) = λz, 0 < λ ⊂ R.
We note also that instead of considering as the origin OH2 , the polar hyperbolic

coordinates can be defined with respect to a different origin z0 and consequently, the

total order is adapted to the new origin (i.e., bottom element is just z0 ).
One can replace in the polar ordering the distance distH2 (OH2 , z) by the τ-order
τ−pol
Hellinger distance to obtain now the total ordering ∇H2 parametrized by τ:
    
τ−pol distHellinger OH2 , z1 ; τ < distHellinger OH2 , z2 ; τ or
z1 ∇H2 z2 ˇ
distHellinger OH2 , z1 ; τ = distHellinger OH2 , z2 ; τ and tan ξ1 ∇ tan ξ2
(12.33)

As we illustrate in Sect. 12.5, the “deformation” of the distance driven by τ can


significantly change the supremum and infimum from a set of points Z. Obviously,
τ−pol
the properties of invariance of ∇H2 are related to the isometries of the τ-order
Hellinger distance.

12.4.4 Upper Half-Plane Geodesic Ordering

As discussed above, there is a unique hyperbolic geodesic joining any pair of points.
Given two points z1 , z2 ⊂ H2 such that x1 ∀= x2 , let SCr1Φ2 (a1Φ2 ) be the semi-circle
defining their geodesic, where the center a1Φ2 and the radius r1Φ2 are given by
Eq. (12.8). Let us denote by z1Φ2 the point of SCr1Φ2 (a1Φ2 ) having the maximal
imaginary part, i.e., its imaginary part is equal to the radius: z1Φ2 = a1Φ2 + ir1Φ2 .
geo
The upper half-plane geodesic ordering ˆH2 defines an order for points being in
the same half of their geodesic semi-circle as follows,

geo a1Φ2 ∇ x1 < x2 or
z1 ˆH 2 z 2 ˇ (12.34)
x2 < x1 ∇ a1Φ2
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 349

geo geo
The property of transitivity of this partial ordering, i.e., z1 ˆH2 z2 , z2 ˆH2 z3 ⇒
geo
z1 ˆH2 z3 , holds for points belonging to the same geodesic. For two points in a
geo
geodesic vertical line, x1 = x2 , we have z1 ˆH2 z2 ˇ y2 ∇ y1 . We note that
considering the duality with respect to the involution (12.28), one has
geo geo
z1 ˆH2 z2 ˇ z1 H2 z2 .

According to this partial ordering, we define the geodesic infimum, denoted by


geo
H2 , as the point on the geodesic joining z1 and z2 with maximal imaginary part,
i.e., for any z1 , z2 ⊂ H2 , with x1 ∀= x2 , we have

 (x1 ∨ x2 ) + i(y1 ∨ y2 ) if x1 , x2 ∇ a1Φ2
geo
z1 H2 z2 ˇ (x1 ˘ x2 ) + i(y1 ∨ y2 ) if x1 , x2 ∓ a1Φ2 (12.35)

z1Φ2 otherwise
geo
If x1 = x2 , we have that z1 H2 z2 = x1 + i(y1 ∨ y2 ). In any case, we have that
geo geo
distH2 (z1 , z2 ) = distH2 (z1 , z1 H2 z2 )+ distH2 (z1 H2 z2 , z2 ). Intuitively, we notice
that the geodesic infimum is the point of the geodesic farthest from the real line.
We observe that if one attempts to define the geodesic supremum from the partial
geo
ordering ˆH2 , it results that the supremum is not defined for any pair of points, i.e.,
supremum between z1 and z2 is defined if and only if both points are in the same
half of its semi-circle. To tackle this limitation, we propose to define the geodesic
geo
supremum z1 H2 z2 by duality with respect to the involution z, i.e.,

⎩ ⎤  (x1 ˘ x2 ) + i(y1 ˘ y2 ) if x1 , x2 ∇ a1Φ2
geo geo
z1 H2 z2 =  z1 H2 z2 ˇ (x1 ∨ x2 ) + i(y1 ˘ y2 ) if x1 , x2 ∓ a1Φ2

z1Φ2 otherwise
(12.36)
where z1Φ2 is the dual point associated to the semi-circle defined by dual points
z1 and z2 .
geo
Nevertheless, in order to have a structure of complete lattice for (H2 , ˆH2 ), it
is required that the infimum and the supremum of any set of points Z = {zk }1∇k∇K
with K ∓ 2, are!geowell defined. Namely, according to (12.35), the geodesic infimum
of Z, denoted H2 Z, corresponds to the point zinf with the maximal imaginary part
on all possible geodesics joining any pair of points zn , zm ⊂ Z. In geometric terms, it
means that between all these geodesics, there exists one which gives !geozinf . Instead of
computing all the geodesics, we propose to define the infimum H2 Z as the point
zinf = ainf + irinf , where ainf is the center of the smallest semi-circle in H2 of radius
rinf which encloses all the points in the set Z. We have the following property
geo
" geo
Z = zinf ˆH2 zk , 1 ∇ k ∇ K,
H2
350 J. Angulo and S. Velasco-Forero

which geometrically means that the geodesic connecting zinf to any point zk of Z lies
always in one of the half part of the semi-circle defined by zinf and zk .
In practice, the minimal enclosing semi-circle defining zinf can be easily computed
by means of the following algorithm based on the minimum enclosing Euclidean
circle MEC of a set of points: (1) Working on R2 , define a set of points given, on
the one hand, by Z and, on the other hand, by Z ∗ which corresponds to the reflected
points with respect to x-axis (complex conjugate), i.e., points Z = {(xk , yk )} and
points Z ∗ = {(xk , −yk )}, 1 ∇ k ∇ K; (2) Compute the MEC(Z ← Z ∗ ) ∈∗ Cr (c), in
such a way that, by a symmetric point configuration,
!geo we necessarily have the center
on x-axis, i.e., c = (xc , 0); (3) The infimum H2 Z = zinf is given by zinf = xc + ir.
Figure 12.7a–b gives an example of computation of the geodesic infimum from a set
of points in H2 .
As for the case of two points, the geodesic supremum of Z is defined by duality
with respect to involution (12.28), i.e.,
 ⎜
geo
# geo
"
zsup = Z =  Z  = asup + irsup , (12.37)
H2 H2

with asup = −xcdual and rsup = 1/r dual , where SCr dual (xcdual ) is the minimal enclos-
ing semi-circle from dual set of points Z. An example of computing the geodesic
supremum zsup is also given in Fig. 12.7a–b. It is easy to see that geodesic infimum
and supremum have the following properties for any Z ⊂ H2 :
geo
1. zinf ˆH2 zsup ;
2. ∅(zinf ) ∓ ∅(zk ) and
$ ∅(zsup ) ∇ ∅(zk%), √z!
k ⊂ Z;
3. 1∇k∇K →(zk ) < →(zinf ), →(zsup ) < 1∇k∇K →(zk ).
The proofs are straightforward from the notion of minimal enclosing semi-circle and
the fact that zsup lies inside the semi-circle defined by zinf .
Geodesic infimum and supremum being defined by minimal enclosing semi-
circles, their invariance properties are related to translation and scaling of points
in set Z as defined in Sect. 2.4, but not to inversion. This invariance domain just
corresponds to the Guts group of transformations, i.e.,
 ⎜
geo
" "geo
{T (zk )}1∇k∇K = T  {zk }1∇k∇K  .
H2 H2

As we discussed in Sect. 12.3, we do not have an explicit algorithm to compute


the Burbea-Rao τ-order entropy geodesic and consequently, our framework based
on computing the minimum enclosing geodesic to define the infimum cannot be
extended to this general case. We can nevertheless consider the example depicted
in Fig. 12.8, where we have computed such smallest Burbea-Rao τ-order geodesic
enclosing the set of points Z. Indeed, the example is useful to identify the limit cases
with respect to τ. In fact, we note that if τ ∗ 0, the corresponding τ-geodesic
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 351

(a) 3 (b) 4
3.5
2.5
3
2
2.5
y = Im(z)

y = Im(z)
1.5 2

1.5
1
1
0.5
0.5

0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
x = Re(z) x = Re(z)

(c) 0.8 (d) 1

0.7 0.9
0.8
0.6
0.7
0.5 0.6
Fμ,σ 2(x)
fμ,σ 2(x)

0.4 0.5

0.3 0.4
0.3
0.2
0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x

!geo
Fig. 12.7 a Set of nine points in H2 , Z = {zk }1∇k∇9 . b Computation of infimum H2 Z = zinf
geo
(blue “×”) and supremum H2 Z = zsup (red “×”). Black“∗” are the original points and green
“∗”
◦ the corresponding dual ones. c In black, set of Gaussian
◦ pdfs associated to Z, i.e., Nk (μ =
2xk , α 2 = yk2 ); in blue, infimum Gaussian pdf Ninf (μ = 2xinf , α 2 = yinf
2 ); in red, supremum

Gaussian pdf Nsup (μ = 2xsup , α 2 = ysup 2 ). d Cumulative distribution functions of Gaussian pdfs

from c

(a)3 (b)3
2.5 2.5

2 2
y = Im(z)

y = Im(z)

1.5 1.5

1 1

0.5 0.5

0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 2
x = Re(z) x = Re(z)

Fig. 12.8 a Set of nine points in H2 , Z = {zk }1∇k∇9 . b Computation of the smallest Burbea-Rao
τ-order geodesic enclosing the set Z, for τ = 0.01 (in green), τ = 1 (in red), τ = 5 (in magenta),
τ = 20 (in blue)
352 J. Angulo and S. Velasco-Forero

infimum will correspond to the zk having the largest imaginary part, and dually for
the supremum, i.e., zk having the smallest imaginary part. In the case of large τ,
we  the τ-geodesic infimum and supremum equals
 note that the real part of both,
∨1∇k∇K →(zk ) − ˘1∇k∇K →(zk ) /2, and the imaginary part of the infimum goes to
+≤ and of the supremum to 0 when τ ∗ +≤.

12.4.5 Upper Half-Pane Aymmetric Godesic Infimum/Supremum

According to the properties of geodesic infimum zinf and supremum zsup discussed
above, we note that their real parts →(zinf ) and →(zsup ) belong to the interval bounded
by the real parts of points of set Z. Moreover, →(zinf ) and →(zsup ) are not ordered
between them. Therefore, the real part of supremum one can be smaller than that of
the infimum one. For instance, in the extreme case of a set Z where all the imaginary
parts are equal, the real part of its geodesic infimum and supremum are both equal to
the average of the real parts of points, i.e., given Z = {zk }1∇k∇K , if yk = y, 1 ∇ k ∇ K,
&
then →(zinf ) = →(zsup ) = 1/K K k=1 xk . From the viewpoint of morphological image
filtering, it can be potentially interesting to impose an asymmetric behavior for the
−∗+ −∗+ ), 1 ∇ k ∇ K. Note
infimum and supremum such that →(zinf ) ∇ zk ∇ →(zsup
that the proposed notation − ∗ + indicates a partially ordered set on x-axis. In order
to fulfil these requirements, we can geometrically consider the rectangle bounding
the minimal enclosing semi-circle, which is just of dimensions 2rinf × rinf , and use it
−∗+
to define the asymmetric infimum zinf as the upper-left corner of the rectangle. The
−∗+
asymmetric supremum zsup is similarly defined from the bounding rectangle of the
dual minimal enclosing semi-circle. Mathematically, given the geodesic infimum zinf
and supremum zsup , we have the following definitions for the asymmetric geodesic
infimum and supremum (Fig. 12.9):

−∗+
zinf = −∗+ Z = (ainf − rinf ) + irinf ;
!H 2 (12.38)
zsup = −∗+
−∗+
H2
Z = −(xcdual − r dual ) + i r dual
1
.

Remark Geodesic infimum and supremum of Gaussian distributions. Let us consider


their interpretation as infimum and supremum of a set of univariate Gaussian◦ pdfs, see
example depicted in Fig. 12.7. Given a set of K Gaussian pdfs Nk (μ = 2xk , α 2 =
yk2 ), 1 ∇ k ∇ K, we observe that the Gaussian pdf associated to the geodesic infimum

Ninf (μ = 2xinf , α 2 = yinf2 ) has a variance larger than any Gaussian of the set and its

mean is a kind of barycenter between◦the Gaussian pdfs having a larger variance. The
supremum Gaussian pdf Nsup (μ = 2xsup , α 2 = ysup 2 ) has a smaller variance than

the K Gaussian pdfs and its mean is between the ones of smaller variance. In terms
of the corresponding cumulative distribution functions, we observe that geodesic
supremum/infimum do not have a natural interpretation.
◦ −∗+ 2In the case of the asymmetric
−∗+ −∗+ 2
Gaussian geodesic infimum Ninf (μ = 2xinf , α = (yinf ) ) and Gaussian
−∗+
◦ −∗+ 2 −∗+
supremum Nsup (μ = 2xsup , α = (ysup ) ), we observe how the means are
2
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 353

(a) 0.8 (b) 1

0.9
0.7
0.8
0.6
0.7
0.5
0.6

Fμ,σ 2(x)
fμ,σ 2(x)

0.4 0.5

0.4
0.3
0.3
0.2
0.2
0.1
0.1

0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x

Fig. 12.9 a Infimum and supremum Gaussian pdfs (in green and red respectively) from asymmetric
−∗+ −∗+ from set of Fig. 12.7. b Cumulative distribution functions of
geodesic infimum zinf and zsup
Gaussian pdfs from (a)

ordered with respect to the K others, which also involves that the corresponding cdfs
are ordered. The latter is related to the notion of stochastic dominance [30] and will
be explored in detail in ongoing research.

12.5 Morphological Operators on F (Φ, H2 ) for Processing


Univariate Gaussian Distribution-Valued Images

Let consider that H2 has been endowed with one of the partial orderings discussed
above, generally denoted by ∇. Hence (H2 , ∇) is a !
poset, which has also a structure
of complete lattice since we consider that infimum and supremum are defined
for any set of points in H2 .

12.5.1 Adjunction on Complete Lattice (H2 , ≤)

The operators ε : H2 ∗ H2 and δ : H2 ∗ H2 are an erosion and! a dilation



if
! they commute  respectively
 with the infimum and the supremum: ε k zk =
k ε(zk ) and δ k zk = k δ(zk ), for every set {zk }1∇k∇K . Erosion and dilation
are increasing operators, i.e., √z, z≡ ⊂ H2 , if z ∇ z≡ then ε(z) ∇ ε(z≡ ) and δ(z) ∇ δ(z≡ ).
Erosion and dilation are related by the notion of adjunction [21, 29], i.e.,

δ(z) ∇ z≡ ˇ z ∇ ε(z≡ ); √z, z≡ ⊂ H2 . (12.39)

Adjunction law (12.39) is of fundamental importance in mathematical morphol-


ogy since it allows to define a unique dilation δ associated to a given erosion ε,
354 J. Angulo and S. Velasco-Forero

!
i.e., δ(z≡ ) = {z ⊂ H2 : z≡ ∇ ε(z)}, z≡ ⊂ H2 . Similarly one can define a
unique erosion from a given dilation: ε(z) = {z≡ ⊂ H2 : δ(z≡ ) ∇ z}, z ⊂ H2 .
Given an adjunction (ε, δ), their composition product operators, ω(z) = δ (ε(z)) and
ϕ(z) = ε (δ(z)) are respectively an opening and a closing, which are the basic mor-
phological filters having very useful properties [21, 29]: idempotency ωω(z) = ω(z),
anti-extensivity ω(z) ∇ z and extensivity z ∇ ϕ(z), and increaseness. Another rele-
vant result is the fact, given an erosion ε, the opening and !closing
$ ≡ by adjunction are
%
exclusively defined in terms of erosion [21] as ω(z) = z ⊂ H2: ε(z) ∇ ε(z≡ ) ,
! $ %
ϕ(z) = ε(z≡ ): z≡ ⊂ H2 , z ∇ ε(z≡ ) , √z ⊂ H2 .
!
In the case of complete inf-semilattice (H2 , ∇), where infimum is defined
but supremum is not necessarily, we have the following particular results
[22, 23]: (a) it is always
! $ ≡ possible to associate% an opening ω to a given erosion ε
by means of ω(z) = z ⊂ H2 : ε(z) ∇ ε(z≡ ) , (b) even though the adjoint dilation
δ is not well-defined in H2 , it is always well-defined for elements on the image of H2
by ε, and (c) ω = δε. The closing defined by ϕ = εδ is only partially defined. Obvi-
ously, in the case of inf-semilattice, it is still possible to define δ such that δ(zk ) =
δ (zk ) for families for which supremum exist.

12.5.2 Erosion and Dilation in F (Φ, H2 )

If (H2 , ∇) is a complete lattice, the set of images F(Δ, H2 ) is also a complete lattice
defined as follows: for all f , g ⊂ F(Δ, H2 ), (i) f ∇ g ˇ f (p) ∇ g(p), √p ⊂ Δ; (ii)
(f ˘ g)(p) = f (p) ˘ g(p), √p ⊂ Δ; (iii) (f ∨ g)(p) = f (p) ∨ g(p), √p ⊂ Δ , where
˘ and ∨ are the infimum and supremum in H2 . One can now define the following
adjoint pair of flat erosion εB (f ) and flat dilation δB (f ) of each pixel p of image f [21,
29]:
"
εB (f )(p) = f (p + q), (12.40)
q⊂B(p)
#
δB (f )(p) = f (p − q), (12.41)
q⊂B(p)

such that

δB (f )(p) ∇ g(p) ˇ f (p) ∇ εB (g)(p); √f , g ⊂ F(Δ, H2 ). (12.42)

where set B is called the structuring element, which defines the set of points in Δ when
it is centered at point p, denoted B(p) [31]. These operators, which are translation
invariant, can be seen as constant-weight (this is the reason why they are called flat)
inf/sup-convolutions, where the structuring element B works as a moving window.
The above erosion (resp. dilation) moves object edges within the image in such a
way that it expands image structures with values in H2 close to the bottom element
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 355

Fig. 12.10 Supremum and infimum of a set of 25 patches parameterized by their mean and standard
◦ the overlapped patches are taken; b embedding into the space H
deviation: a in red the region where 2

according to the coordinates μ/ 2 and α and corresponding sup and inf for the different ordering
strategies

(resp. close to the top) of the lattice F(Δ, H2 ) and shrinks the objects with values
close to the top element (resp. close to the bottom).
Let us consider now the various cases of supremum and infimum that we have
introduced above. In order to support the discussion, we have included an example
in Fig. 12.10. In fact, we have taken all the patches of size 5 × 5 pixels surrounding
one of the pixels from image of Fig. 12.1. The ◦ 25 patches are then embedded into
the space H2 according to the coordinates μ/ 2 and α. Finally, the supremum and
infimum of this set of points are computed for the different cases. It just corresponds
to the way to obtain respectively the dilation and erosion for the current pixel center
of the red region in image Fig. 12.10a.
Everything works perfectly for ! the supremum and infimum in the upper half-
plane product ordering H2 and H2 , which consequently can be used to construct
dilation and erosion operators in F(Δ, H2 ). In fact, this is exactly equivalent to the
classical operators applied on the real and imaginary parts separately.
pol !pol
Similarly, the ones for the upper half-plane polar ordering H2 and H2 , based
on a total partial ordering, also lead respectively to dilation and erosion operators.
The erosion produces a point which corresponds here to the patch closer to the origin.
That means a patch of intermediate mean and standard deviation intensity since the
image intensity is normalized, see Sect. 5.4. On the contrary, the dilation gives a point
associated to the farthest patch from the origin. In this example, an homogenous bright
patch. Note that patches of great distance correspond to the most “contrasted ones”
on the image: either homogeneous patches of dark or bright intensity or patches with
a strong variation in intensity (edge patches).
We note that for thesymmetric ordering ˆH2 one only has an inf-semilattice
structure associated to H2 . However, in the case of the upper half-plane geodesic
geo
ordering, the pair of operators (12.40) and (12.41) associated to our supremum H2
!geo
and infimum H2 will not verify the adjunction (12.42). Same limitation also holds
for the upper half-plane asymmetric geodesic supremum and infimum. Hence, the
geodesic supremum and infimum do not strictly involve a pair of dilation and erosion
356 J. Angulo and S. Velasco-Forero

in the mathematical morphology sense. Nevertheless, we can compute both operators


and use them to filter out images in F(Δ, H2 ) without problem. From the example
of Fig. 12.10 we observe that the geodesic infimum gives a point with a standard
deviation equal to or larger than any of the patches and a mean intermediate between
the patches of high standard deviation. The supremum involves a point of standard
deviation smaller than (or equal to) than the others, and the mean is obtained by
averaging around the mean of the ones with a small standard deviation. Consequently
the erosion involves a nonlinear filtering which enhances the image zones of high
standard deviation, typically the contours. The dilation enhances the homogenous
zones. The asymmetrization produces operators where the dilation and erosion have
the same interpretation for the mean as the classical ones but the filtering effects are
driven by the zones of low or high standard deviation.

12.5.3 Opening and Closing in F (Φ, H2 )

Given the adjoint image operators (εB , δB ), the opening and closing by adjunction
of image f , according to structuring element B, are defined as the composition oper-
ators [21, 29]:

ωB (f ) = δB (εB (f )) , (12.43)
ϕB (f ) = εB (δB (f )) . (12.44)

Openings and closings are referred to as morphological filters, which remove objects
of image f that do not comply with a criterion related, on the one hand, to the
invariance of the object support to the structuring element B and, on the other hand,
to the values of the object on H2 which are far from (in the case of the opening) or
near to (in the case of the closing) the bottom element of H2 according to the given
partial ordering ∇.
Once the pairs of dual operators (εB , δB ) and (ωB , ϕB ) are defined, the other
morphological filters and transformation can be naturally defined [31] for images in
F(Δ, H2 ). We limit here the illustrative examples to the basic ones.
Following our analysis on the particular cases of ordering and supremum/infimum
in H2 , we can conclude that opening and closing in F(Δ, H2 ) are well formulated for
the upper half-plane product ordering and the upper half-plane polar ordering. In the
case of the upper half-plane symmetric ordering, the opening is always defined and the
closing cannot be computed. Again, we should insist on the fact that for the upper half-
geo
plane geodesic ordering, the composition operators obtained by supremum H2 and
!geo
infimum H2 will not produce opening and closing stricto sensu. Notwithstanding,
the corresponding composition operators yield a regularization effect of F(Δ, H2 )-
images which can be of interest for practical applications.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 357

12.5.4 Application to Morphological Processing Univariate


Gaussian Distribution Valued Images

Example 1 A first example of morphological processing for images in F(Δ, H2 ) is


given in Fig. 12.11. The starting point is a standard gray-level image g ⊂ F(Δ, R),
which is mapped to the image f (p) = fx (p)+ify (p) by the following transformations:
(1) the image is normalized to have zero mean and a unit variance; (2) the real and
imaginary components of f (p) are obtained by computing respectively the mean and
standard deviation over a patch centered at p of radius W pixels (in the example
W = 4); i.e.,

g(p) − Mean(g) ⎟
g(p) ∈∗ ĝ(p) = ◦ ∈∗ f (p) = MeanW (ĝ)(p) + i VarW (ĝ)(p).
Var(g)

We note that definition of our representation space fy (p) > 0. It means that the
variance of each patch should always be bigger than zero and obviously this is not
the case in constant patches. In order to cope with this problem, we propose to add
a ν to the value of the standard deviation.
Figure 12.11 gives a comparison of morphological erosions εB (f )(p) and openings
ωB (f )(p) on this image f using the five complete (inf-semi)lattice of H2 considered
in the paper. We have included also the pseudo-erosions and pseudo-openings asso-
ciated to the geodesic supremum and infimum and the asymmetric geodesic ones.
The same structuring element B, a square of 5 × 5 pixels, has been used for all
the examples. First of all, we remind again that working on the product complete
lattice (H2 , ∇H2 ) is equivalent to a marginal processing of real and imaginary com-
ponents. As expected, the symmetric ordering-based inf-semilattice (H2 , ˆH2 ) and
pol
polar ordering-based lattice (H2 , ∇H2 ) produce rather similar results for openings.
We observe that in both cases the opening produces a symmetric filtering effect
between bright/dark intensity in the mean and standard deviation component. But it
is important to remark that the processing effects depend on how image components
are valued with respect to the origin z0 = (0, 1). This is the reason why it is proposed
to always normalize by mean/variance the image.
The results of the pseudo-openings produced by working on geodesic lattice
geo !geo !
(H2 , H2 , H2 ) and asymmetric geodesic lattice (H2 , −∗+ H2
, −∗+
H2
) involves
a processing which is mainly driven by the values of the standard deviation. Hence,
the filtering effects are potentially more interesting for applications requiring to deal
with pixel uncertainty, either in a symmetric processing of both bright/dark mean
geo !geo
values with (H2 , H2 , H2 ) or in a more classical morphological asymmetrization
!
with (H2 , −∗+ H2
, −∗+
H2
).
Example 2 Figure 12.12 illustrates a comparative example of erosions εB (f )(p) on a
very noisy image g(p). We note that g(p) is mean centered. The “noise” is related to
an acquisition at the limit of exposure time/spatial resolution. We consider an image
model f (p) = fx (p)+ify (p), where fx (p) = g(x) and fy (p) is the standard deviation of
intensities in a patch of radius equal to 4 pixels. In fact, the results of erosion obtained
358 J. Angulo and S. Velasco-Forero

Fig. 12.11 Comparison of morphological erosions and openings of an image f ⊂ F (Δ, H2 ):


a Original real-valued image g(p) ⊂ F (Δ, R) used to simulate (see the text) the image
f (p) = fx (p) + ify (p), where b and c gives respectively the real and imaginary components.
d- and e- depict respectively the erosion εB (f )(p) and opening ωB (f )(p) of image f (p) for five
orderings on the upper half-plane. The structuring element B is a window of 5 × 5 pixels. (Contin-
ued) in next figure

by the product and symmetric partial orderings, are compared to ones obtained by
polar ordering and more generally by the τ-polar ordering with four values of τ.
We observe, on the one hand, polar orderings are more relevant than the product or
symmetric ones. As expected, the τ-polar erosion with τ = 1 is almost equivalent
to the hyperbolic polar ordering. We note, on the other hand, the interest of the limit
cases of τ-polar erosion. The erosion for small τ produces a strongly regularized
image where the bright/dark objects with respect to the background has been nicely
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 359

Fig. 12.11 continued

Fig. 12.11 (continued)

enhanced. In the case of large τ, the background (i.e., pixels values close to the origin
in H2 ) is enhanced, which involves removing all the image structures smaller than
the structuring element B.
Example 3 In Fig. 12.13 a limited comparison for the case of dilation δB (f )(p) is
depicted. The image f (p) = fx (p)+ify (p) is obtained similarly to the case of Example
1. We can compare the supremum by product ordering with those obtained by the
polar supremum and the τ-polar supremum, with τ = 0.01. The analysis is similar
to the previous case.
360 J. Angulo and S. Velasco-Forero

Fig. 12.12 Comparison of erosion of Gaussian distribution-valued noisy image εB (f )(p): a Original
image f ⊂ F (Δ, H2 ), showing both the real and the imaginary components; b upper half-plane
product ordering (equivalent to standard processing); c upper half-space symmetric ordering;
d upper half-plane polar ordering; e–h upper half-plane τ-polar ordering, with four values of τ. In
all the cases the structuring element B is also a square of 5 × 5 pixels

Example 4 Figure 12.14 involves again the noisy retinal image, and it shows a com-
parison of results from (pseudo-)opening ωB (f )(p) and (pseudo-)closing ϕB (f )(p)
geo !geo
obtained for the product ordering, the geodesic lattice (H2 , H2 , H2 ) and the
!
asymmetric geodesic lattice (H2 , −∗+ H2
, −∗+
H2
). The structuring element B is a
square of 5 × 5 pixels. In order to be able to compare their enhancement effects
with an averaging operator, it is also given the result of filtering by computing the
minimax center in a square of 5 × 5 pixels [2, 8], see Remark in Sect. 2.5. We note
that operators associated to the asymmetric geodesic supremum and infimum yield
mean images relatively similar to the standard ones underlaying the supremum and
infimum in the product lattice. However, including the information given by the local
standard deviation, the contrast of the structures is better in the asymmetric geodesic
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 361

Fig. 12.13 Comparison of dilation of Gaussian distribution-valued image δB (f )(p): a Original


image f ⊂ F (Δ, H2 ), showing both the real and the imaginary components; b upper half-plane
product ordering (equivalent to standard processing); c upper half-plane polar ordering; e half-plane
τ-polar ordering, with τ = 0.01. In all the cases the structuring element B is also a square of 5 × 5
pixels

supremum and infimum. Nevertheless, we observe that the operators by geodesic


supremum and infimum also produce in this example a significant regularization of
the image. By the way, we note that the corresponding geodesic pseudo-opening
and pseudo-closing give rather similar mean images but different standard deviation
images, as expected by the formulation of the geodesic supremum and infimum.
Example 5 The example given in Fig. 12.15 corresponds to an image f (p) = fx (p) +
ify (p) obtained by multiple acquisition of a sequence of 100 frames, where fx (p)
represents the mean intensity at each pixel and fy (p) the standard deviation of intensity
along the sequence. The 100 frames have been taken from a stationary camera.
The goal of the example is to show how to extract image objects of large intensity
and support size smaller than the structuring element (here a square of 7 × 7 pixels)
using the residue between the original image f (p) and its filtered image by opening
ωB (f ). In the case of images on F(Δ, H2 ), the residue is defined as the pixelwise
hyperbolic distance between them. In this case study, results on processing on polar
ordering-based lattice versus asymmetric geodesic lattice are compared.

12.5.5 Conclusions on Morphological Operators


for F (Φ, H2 ) images

Based on the discussion given in Sect. 5.2 as well as on the examples from Sect. 5.4,
we can draw some conclusions on the experimental part of this chapter.
362 J. Angulo and S. Velasco-Forero

Fig. 12.14 Morphological processing of Gaussian distribution-valued noisy image: a Original


image f ⊂ F (Δ, H2 ), showing both the real and the imaginary components; b filtered image by
computing the minimax center in a square of 5 × 5 pixels; c morphological opening working on the
product lattice; d morphological closing working on the product lattice; e morphological pseudo-
opening working on the geodesic lattice; f morphological pseudo-opening on the asymmetric geo-
desic lattice; g morphological pseudo-closing working on the geodesic lattice; h morphological
pseudo-closing on the asymmetric geodesic lattice. In all the cases the structuring element B is also
a square of 5 × 5 pixels

• First of all, we note that the examples considered here are only a preliminary
exploration on the potential applications of morphological processing univariate
Gaussian distribution-valued images.
• We have two main case studies. First, standard images which are embedded into the
Poincaré upper-half plane representation by parameterization of each local patch
by its mean and standard deviation. Second, images which naturally involves a
distribution of values at each pixel. Note that in the first case, the information
of standard deviation is mainly associated to discriminate between homogenous
zones and inhomogeneous ones (textures or contours). In the second case, the
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 363

Fig. 12.15 Morphological detail extraction of multiple acquisition image modeled as a Gaussian
distribution-valued: a Original image f ⊂ F (Δ, H2 ), showing both the real and the imaginary com-
ponents; b morphological opening ωB (f ) working on polar ordering-based lattice; c corresponding
residue (pixelwise hyperbolic difference) between the original and the opened image; d morpholog-
ical pseudo-opening ωB (f ) working on the asymmetric geodesic lattice; e corresponding residue.
In both cases the structuring element B is also a square of 7 × 7 pixels

standard deviation involves relevant information on the nature of the noise during
the acquisition.
• For any of these two cases, we should remark that different alternatives of ordering
and derived operators considered in the paper will produce nonlinear processing its
main property is that filtering effects are strongly driven by the standard deviation.
• Upper half-plane product ordering is nothing more than standard processing of
mean and standard deviation separately. The symmetric ordering leading to an
inf-semilattice has a limited interest since similar effects are obtained by the polar
ordering.
• Upper half-plane polar ordering using standard hyperbolic polar coordinates or
the τ-order Hellinger distance produces morphological operators appropriate for
image regularization and enhancement. We remind that points close to the ori-
gin (selected by the erosion) correspond in the case of the patches to those of
intermediate mean and standard deviation intensity after normalization. On the
contrary, patches far from the origin correspond to the “contrasted” ones: either
homogeneous patches of dark or bright intensity or patches with a strong variation
in intensity (edge patches).
364 J. Angulo and S. Velasco-Forero

We note that with respect to filters based on averaging, the half-plane polar
dilation/erosion as well as their product operators, produces strong simplified
images where the edges and the main objects are enhanced without any blurring
effect.
From our viewpoint this is useful for both cases of images. Then, the choice of a
high or low value for τ will depend on the particular nature of the features to be
enhanced. In any case, this parameter can be optimized.
• Upper half-plane geodesic ordering involves a nonlinear filtering framework which
takes into account the intrinsic geometry of H2 . It is mainly based on the notion
of minimal enclosing geodesic which covers the set of points.

In practice, the geodesic infimum gives a point with a standard deviation equal to or
larger than any of the point and a mean which can be seen as intermediate between the
mean values of high standard deviation. The supremum produces a point of standard
deviation equal to or smaller than the others, and the mean is obtained by averaging
around the mean of the ones having a small standard deviation.
Consequently the erosion involves a nonlinear filtering which enhances the image
zones of high standard deviation, typically the contours. The dilation enhances the
homogenous zones. We should note that the processed mean images by the com-
position of these two operators (i.e., openings and closings) are strongly enhanced
by increasing their bright/dark contrast. Therefore, it should be considered as an
appropriate tool for contrast structure enhancement on irregular backgrounds.
The asymmetric version of the geodesic ordering involves that dilation and erosion
have the same interpretation for the mean as the classical ones but the filtering
effects are driven by the zones of low or high standard deviation. These operators are
potentially useful for object extraction by residue between the original image and
the opening/closing. In comparison with classical residues, the new ones produce
sharper extracted objects.

12.6 Perspectives

Levelings are a powerful family self-dual morphological operators which have been
also formulated in vector spaces [24], using geometric notions as minimum enclosing
balls and half-planes intersection. We intend to explore the formulation of levelings
in the upper half-plane in a future work.
The complete lattice structures for the Poincaré upper-half plane introduced in this
work, and corresponding morphological operators, can be applied to process other
hyperbolic-valued images. For instance, on the one hand, it was proven in [13] that the
structure tensor for 2D images, i.e., at each pixel is given a 2 × 2 symmetric positive
definite matrix which determinant is equal to 1, are isomorphic to the Poincaré unit
disk model. On the other hand, polarimetric images [17] where at each pixel is given
a partially polarized state can be embedded in the Poincaré unit disk model. In both
cases, we only need the mapping from the Poincaré disk model to the Poincaré
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 365

half-plane, i.e.,

z+1
z ∈∗ −i .
z−1

We have considered here the case of Gaussian distribution-valued images. It


should be potentially interesting for practical applications to consider that the distri-
bution of intensity at a given pixel belongs to a more general distributions compared
to the Gaussian one. In particular, the case of the Gamma distribution seems an
appropriate framework. The information geometry of the gamma manifold has been
studied in the past [16] and some of the ideas developed in this work can be revisited
for the case of Gamma-distribution valued images by endowing the gamma manifold
of complete lattice structure.
Previous extension only concerns the generalization of ordering structure for
univariate distributions. In the case of multivariate Gaussian distributions, we can
consider to replace the Poincaré upper-half plane by the Siegel upper-half space [7].

References

1. Angulo, J., Velasco-Forero, S.: Complete lattice structure of Poincaré upper-half plane and
mathematical morphology for hyperbolic-valued images. In: Nielsen, F., Barbaresco, F. (eds.)
Proceedings of First International Conference Geometric Science of Information (GSI’2013),
vol. 8085, pp. 535–542. Springer LNCS (2013)
2. Arnaudon, M., Nielsen, F.: On approximating the riemannian 1-center. Comput. Geom. 46(1),
93–104 (2013)
3. Amari, S.-I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differential
geometry in statistical inference. Lecture Notes-Monograph Series, vol. 10, pp. 19–94, Institute
of Mathematical Statistics, Hayward (1987)
4. Amari, S.-I., Nagaoka, H.: Methods of information geometry, translations of mathematical
monographs. Am. Math. Soc. 191, (2000)
5. Barbaresco, F.: Interactions between symmetric cone and information geometries: Bruhat-Tits
and siegel spaces models for high resolution autoregressive doppler imagery. In: Nielsen, F.
(eds.) Emerging Trends in Visual Computing (ETCV’08), Springer LNCS, Heidelberg vol.
5416, pp. 124–163, (2009)
6. Barbaresco, F.: Geometric radar processing based on Fréchet distance: information geome-
try versus optimal transport theory. In: Proceedings of IEEE International Radar Symposium
(IRS’2011), pp. 663–668 (2011)
7. Barbaresco, F.: Information geometry of covariance matrix: cartan-siegel homogeneous
bounded domains, Mostow/Berger fibration and fréchet median. In: Nielsen, F., Bhatia, R.
(eds.) Matrix Information Geometry, pp. 199–255, Springer, Heidelberg (2013)
8. Bădoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the Fourteenth
annual ACM-SIAM Symposium on Discrete Algorithms (SIAM), pp. 801–802, ACM, New
York(2003)
9. Burbea, J., Rao, C.R.: Entropy differential metric, distance and divergence measures in prob-
ability spaces: a unified approach. J. Multivar. Anal. 12(4), 575–96 (1982)
10. Cǎliman, A., Ivanovici, M., Richard, N.: Probabilistic pseudo-morphology for grayscale and
color images. Pattern Recogn. 47, 721–35 (2004)
11. Cammarota, V., Orsingher, E.: Travelling randomly on the poincaré half-plane with a
pythagorean compass. J. Stat. Phys. 130(3), 455–82 (2008)
366 J. Angulo and S. Velasco-Forero

12. Cannon, J.W., Floyd, W.J., Kenyon, R., Parry, W.R.: Hyperbolic geometry. Flavors of Geometry,
vol. 31, MSRI Publications, Cambridge (1997)
13. Chossat, P., Faugeras, O.: Hyperbolic planforms in relation to visual edges and textures per-
ception. PLoS Comput. Biol. 5(12), p1 (2009)
14. Costa, S.I.R., Santos, S.A., Strapasson, J.E.: Fisher information matrix and hyperbolic geom-
etry. In: Proc. of IEEE ISOC ITW2005 on Coding and Complexity, pp. 34–36, (2005)
15. Costa, S.I.R., Santos, S.A., Strapasson, J.E.: Fisher information distance: a geometrical reading,
arXiv:1210:2354v1, p. 15 (2012)
16. Dodson, C.T.J., Matsuzoe, H.: An affine embedding of the gamma manifold. Appl. Sci. 5(1),
7–12 (2003)
17. Frontera-Pons, J., Angulo, J.: Morphological operators for images valued on the sphere. In:
Proceedings of IEEE ICIP’12 ( IEEE International Conference on Image Processing), pp.
113–116, Orlando (Florida), USA, October (2012)
18. Fuchs, L.: Partially Ordered Algebraic Systems. Pergamon, Oxford (1963)
19. Guts, A.K.: Mappings of families of oricycles in lobachevsky space. Math. USSR-Sb. 19, 131–8
(1973)
20. Guts, A.K.: Mappings of an ordered lobachevsky space. Siberian Math. J. 27(3), 347–61 (1986)
21. Heijmans, H.J.A.M.: Morphological Image Operators. Academic Press, Boston (1994)
22. Heijmans, H.J.A.M., Keshet, R.: Inf-semilattice approach to self-dual morphology. J. Math.
Imaging Vis. 17(1), 55–80 (2002)
23. Keshet, R.: Mathematical morphology on complete semilattices and its applications to image
processing. Fundamenta Informaticæ 41, 33–56 (2000)
24. Meyer, F.: Vectorial Levelings and Flattenings. In: Mathematical Morphology and its Appli-
cations to Image and Signal Processing (Proc. of ISMM’02), pp. 51–60, Kluwer Academic
Publishers, Dordrecht (2000)
25. Nielsen, F., Nock, R.: On the smallest enclosing information disk. Inform. Process. Lett. 105,
93–7 (2008)
26. Nielsen, F., Nock. R.: Hyperbolic voronoi diagrams made easy. In: Proceedings of the 2010
IEEE International Conference on Computational Science and Its Applications, pp. 74–80,
IEEE Computer Society, Washington (2010)
27. Sachs, Z.: Classification of the isometries of the upper half-plane, p. 14. University of Chicago,
VIGRE REU (2011)
28. Sbaiz, L., Yang, F., Charbon, E., Süsstrunk, S., Vetterli, M.: The gigavision camera. In: Pro-
ceedings of IEEE ICASSP’09, pp. 1093–1096 (2009)
29. Serra, J.: Image Analysis and Mathematical Morphology. Vol II: theoretical advances, Acad-
emic Press, London (1988)
30. Shaked, M., Shanthikumar, G.: Stochastic Orders and Their Applications. Associated Press,
New York (1994)
31. Soille, P.: Morphological Image Analysis. Springer-Verlag, Berlin (1999)
32. Treibergs, A.: The hyperbolic plane and its immersions into R3 , Lecture Notes in Department
of Mathematics, p. 13. University of Utah (2003)
Chapter 13
Dimensionality Reduction for Classification
of Stochastic Texture Images

C. T. J. Dodson and W. W. Sampson

Abstract Stochastic textures yield images representing density variations of differ-


ing degrees of spatial disorder, ranging from mixtures of Poisson point processes
to macrostructures of distributed finite objects. They arise in areas such as sig-
nal processing, molecular biology, cosmology, agricultural spatial distributions,
oceanography, meteorology, tomography, radiography and medicine. The new con-
tribution here is to couple information geometry with multidimensional scaling, also
called dimensionality reduction, to identify small numbers of prominent features
concerning density fluctuation and clustering in stochastic texture images, for clas-
sification of groupings in large datasets. Familiar examples of materials with such
textures in one dimension are cotton yarns, audio noise and genomes, and in two
dimensions paper and nonwoven fibre networks for which radiographic images are
used to assess local variability and intensity of fibre clustering. Information geom-
etry of trivariate Gaussian spatial distributions of mean pixel density with the mean
densities of its first and second neighbours illustrate features related to sizes and den-
sity of clusters in stochastic texture images. We derive also analytic results for the
case of stochastic textures arising from Poisson processes of line segments on a line
and rectangles in a plane. Comparing human and yeast genomes, we use 12-variate
spatial covariances to capture possible differences relating to secondary structure.
For each of our types of stochastic textures: analytic, simulated, and experimental,
we obtain dimensionality reduction and hence 3D embeddings of sets of samples to
illustrate the various features that are revealed, such as mean density, size and shape
of distributed objects, and clustering effects.

C. T. J. Dodson (B)
School of Mathematics, University of Manchester,
Manchester, M13 9PL, UK
e-mail: [email protected]
W. W. Sampson
School of Materials, University of Manchester, Manchester, M13 9PL, UK
e-mail: [email protected]

F. Nielsen (ed.), Geometric Theory of Information, 367


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2_13,
© Springer International Publishing Switzerland 2014
368 C. T. J. Dodson and W. W. Sampson

Keywords Dimensionality reduction · Stochastic texture · Density array · Cluster-


ing · Spatial covariance · Trivariate Gaussian · Radiographic images · Genome ·
Simulations · Poisson process

13.1 Introduction

The new contribution in this paper is to couple information geometry with dimension-
ality reduction, to identify small numbers of prominent features concerning density
fluctuation and clustering in stochastic texture images, for classification of group-
ings in large datasets. Our methodology applies to any stochastic texture images,
in one, two or three dimensions, but to gain an impression of the nature of exam-
ples we analyse some familiar materials for which we have areal density arrays, and
derive analytic expressions of spatial covariance matrices for Poisson processes of
finite objects in one and two dimensions. Information geometry provides a natural
distance structure on the textures via their spatial covariances, which allows us to
obtain multidimensional scaling or dimensionality reduction and hence 3D embed-
dings of sets of samples. See Mardia et al. [14] for an account of the original work
on multidimensional scaling.
The simplest one-dimensional stochastic texture arises as the density variation
along a cotton yarn, consisting of a near-Poisson process of finite length cotton
fibres on a line, another is an audio noise drone consisting of a Poisson process of
superposed finite length notes or chords. A fundamental microscopic 1-dimensional
stochastic process is the distribution of the 20 amino acids along protein chains in a
genome [1, 3]. Figure 13.1 shows a sample of such a sequence of the 20 amino acids
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y mapped onto the 20 grey
level values 0.025, 0.075, . . . , 0.975 from the database [19], so yielding a grey-level
barcode as a 1-dimensional texture. We analyse such textures in Sect. 13.6.5.
The largest 3-dimensional stochastic structure is the cosmological void distribu-
tion, which is observable via radio astronomy [1]. More familiar three-dimensional
stochastic porous materials include metallic (Fig. 13.2) and plastic solid foams,
geological strata and dispersions in gels, observable via computer tomography [1].
Near-planar, non-woven stochastic fibre networks are manufactured for a variety of
applications such as, at the macroscale for printing, textiles, reinforcing, and fil-
tration and at the nanoscale in medicine. Figure 13.3 shows a selection of electron
micrographs for networks at different scales. Radiography or optical densitometry
yield areal density images of the kinds shown in Fig. 13.4.
Much analytic work has been done on modelling of the statistical geometry of sto-
chastic fibrous networks [1, 6, 7, 17]. Using complete sampling by square cells, their
areal density distribution is typically well represented by a log-gamma or a (trun-
cated) Gaussian distribution of variance that decreases monotonically with increasing
cell size; the rate of decay is dependent on fibre and fibre cluster dimensions. They
have gamma void size distributions with a long tail. Clustering of fibres is well-
approximated by Poisson processes of Poisson clusters of differing density and size.
13 Dimensionality Reduction for Classification of Stochastic Texture Images 369

Saccharomyces Cerevisiae Amino Acids SC1

Fig. 13.1 Example of a 1-dimensional stochastic texture, a grey level barcode for the amino acid
sequence in a sample of the Saccharomyces cerevisiae yeast genome from the database [19]

Fig. 13.2 Aluminium foam with a narrow Gaussian-like distribution of void sizes of around 1 cm
diameter partially wrapped in fragmented metallic shells, used as crushable buffers inside vehicle
bodies. The cosmological void distribution is by contrast gamma-like with a long tail [8], inter-
spersed with 60 % of galaxies in large-scale sheets, 20 % in rich filaments and 20 % in sparse
filaments [12]. Such 3D stochastic porous materials can both be studied by tomographic meth-
ods, albeit at different scales by different technologies, yielding sequences of 2D stochastic texture
images

An unclustered Poisson process of single fibres is the standard reference structure


for any given size distribution of fibres; its statistical geometry is well-understood
for finite and infinite fibres. Note that any skewness associated with the underlying
point process of fibre centres becomes negligible through the process of sampling
by square cells [18].
Many stochastic textures arise from spatial processes that may be approximated by
mixtures of Poisson or other distributions of finite objects or clusters of objects, in an
analogous way to that which has been used for the past 50 years for the study of fibre
networks. The Central Limit Theorem suggests that often such spatial processes may
be represented by Gaussian pixel density distributions, with variance decreasing as
pixel size increases, the gradient of this decrease reflecting the size distributions and
abundances of the distributed objects and clusters, hence indicating the appropriate
pixel size to choose for feature extraction. Once a pixel size has been chosen then
we are interested in the statistics of three random variables: the mean density in
such pixels, and the mean densities of its first and second neighbouring pixels. The
correlations among these three random variables reflect the size and distribution of
370 C. T. J. Dodson and W. W. Sampson

Fig. 13.3 Electron micrographs of four stochastic fibrous materials. Top left Nonwoven carbon
fibre mat; top right glass fibre filter; bottom left electrospun nylon nanofibrous network (Courtesy
S. J. Eichhorn and D. J. Scurr); bottom right paper using wood cellulose fibres—typically flat
ribbonlike, of length 1–2 mm and width 0.02–0.03 mm

Fig. 13.4 Areal density radiographs of three paper networks made from natural wood cellulose
fibres, with constant mean coverage, c̄ ⊂ 20 fibres, but different distributions of fibres. Each image
represents a square region of side length 5 cm; darker regions correspond to higher coverage. The
left image is similar to that expected for a Poisson process of the same fibres, so typical real samples
exhibit clustering of fibres
13 Dimensionality Reduction for Classification of Stochastic Texture Images 371

Fig. 13.5 Trivariate distribution of pixel density values for radiograph of a 5 cm square newsprint
sample. Left source density map; centre histogram of β̃i , β̃1,i and β̃2,i ; right 3D scatter plot of β̃i ,
β̃1,i and β̃2,i

density clusters; this may be extended to more random variables by using also third,
fourth, etc., neighbours. In some cases, of course, other pixel density distributions
may be more appropriate, such as mixtures of Gaussians.

13.2 Spatial Covariance

The mean of a random value p is its average value, p̄, over the population. The
covariance Cov( p, q) of a pair of random variables, p and q is a measure of the
degree of association between them, the difference between their mean product and
the product of their means:

Cov( p, q) = p q − p̄ q̄ . (13.1)

In particular, the covariance of a variable with itself is its variance. From the array of
local average pixel density values β̃i , we generate two numbers associated with each:
the average density of the six first-neighbour pixels, β̃1,i and the average density of
the 16 second-neighbour pixels, β̃2,i . Thus, we have a trivariate distribution of the
random variables (β̃i , β̃1,i , β̃2,i ) with β̄2 = β̄1 = β̄.
Figure 13.5 provides an example of a typical data set obtained from a radiograph
of a 5 cm square commercial newsprint sample; the histogram and three-dimensional
scatter plot show data obtained for pixels of side 1 mm.
From the Central Limit Theorem, we expect the marginal distributions of β̃i ,
β̃1,i and β̃2,i to be well approximated by Gaussian distributions. For the example in
Fig. 13.5, these Gaussians are represented by the solid lines on the histogram; this
Gaussian approximation holds for all samples investigated in this study.
We have a simulator for creating stochastic fibre networks [10]. The code works by
dropping clusters of fibres within a circular region where the centre of each cluster is
distributed as a point Poisson process in the plane and the number of fibres per cluster,
372 C. T. J. Dodson and W. W. Sampson

Fig. 13.6 Simulated areal density maps each representing a 4 cm × 4 cm region formed from fibres
with length λ = 1 mm, to a mean coverage of 6 fibres

n c , is a Poisson distributed random variable. The size of each cluster is determined


by an intensity parameter, 0 < I ∗ 1 such that the mean mass per unit area of the
cluster is constant and less than the areal density of a fibre. Denoting the length and
width of a fibre by λ and ω respectively, the radius of a cluster containing n c fibre
centres is 
nc λ ω
r= . (13.2)
πI

Figure 13.6 shows examples of density maps generated by the simulator. We


observe textures that increase in ‘cloudyness’ with n c and increase in ‘graininess’
with I .

13.3 Analytic Covariance for Spatial Poisson Processes of Finite


Objects

Consider a Poisson process in the plane for finite rectangles of length λ and width
ω ∗ λ, with uniform orientation of rectangle axes to a fixed direction. The covariance
or autocorrelation function for such objects is known and given by [7]:
13 Dimensionality Reduction for Classification of Stochastic Texture Images 373

For 0 < r ∗ ω  
2 r r r2
α1 (r ) = 1 − + − . (13.3)
π λ ω 2ωλ

For ω < r ∗ λ
⎛ ⎧ ⎨
2⎝ ⎞ω ⎠ ω r r2
α2 (r ) = arcsin − − + − 1⎩. (13.4)
π r 2λ ω ω2


For λ < r ∗ (λ2 + ω 2 )
 ⎞ω ⎠  
2 λ ω λ r2
α3 (r ) = arcsin − arccos − − −
π r r 2λ 2ω 2λω
⎧ ⎧ ⎨
r2 r2
+ − 1 + − 1⎩. (13.5)
λ2 ω2

Then, the coverage c at a point is the number of rectangles overlapping that point,
a Poisson variable with grand mean value c̄, and the average coverage or density
in finite pixels c̃ tends to a Gaussian random variable. For sampling of the process
using, say square inspection pixels of side length x, the variance of their density c̃(x)
is ∈
⎥ 2x
V ar (c̃(x)) = V ar (c(0)) α(r, ω, λ) b(r ) dr (13.6)
0

where b is the probability density function for the distance r between two points
chosen independently and at random in the given type of pixel; it was derived by
Ghosh [13].

Using square pixels of side length x, for 0 ∗ r ∗ x


 
4r πx 2 r2
b(r, x) = − 2r x + . (13.7)
x4 2 2

For x ∗ r ∗ 2x

4r ⎞ 2 ⎞ ⎞x ⎠ ⎞ x ⎠⎠⎠
b(r, x) = x arcsin − arccos
x4  r r 
4r ⎤ 1
+ 4 2x (r 2 − x 2 ) − (r + 2x 2 ) .
2
(13.8)
x 2
374 C. T. J. Dodson and W. W. Sampson

Fig. 13.7 Probability density function b(r, 1) from Eqs. (13.7), (13.8) for the distance r between
two points chosen independently and at random in a unit square

A plot of this function is given in Fig. 13.7. Observe that, for vanishingly small
pixels, that is points, b degenerates into a delta function on r = 0. Ghosh [13] gave
also the form of b for other types of pixels; for arbitrary rectangular pixels those
expressions can be found in [7]. For small values of r, so r  D, the formulae for
convex pixels of area A and perimeter P all reduce to

2πr 2Pr 2
b(r, A, P) = −
A A2
which would be appropriate to use when the rectangle dimensions ω, λ are small
compared with the dimensions of the pixel.
It helps to visualize practical variance computations by considering the case of
sampling using large square pixels of side mx say, which themselves consist of
exactly m 2 small square pixels of side x. The variance V ar (c̃(mx)) is related to
V ar (c̃(x)) through the covariance Cov(x, mx) of x-pixels in mx-pixels [7]:

1 m2 − 1
V ar (c̃(mx)) = V ar ( c̃(x)) + Cov(x, mx).
m2 m2

As m ≡ ∇, the small pixels tend towards points, m12 V ar (c̃(x)) ≡ 0 so V ar (c̃(mx))


admits interpretation as Cov(0, mx), the covariance among points inside mx-pixels,
the intra-pixel covariance, precisely V ar (c̃(mx)) from Eq. (13.6).
The fractional between pixel variance for x-pixels is

Cov(0, x) V ar (c̃(x))
ρ̃(x) = =
V ar (c(0)) V ar (c(0))
13 Dimensionality Reduction for Classification of Stochastic Texture Images 375

which increases monotonically with λ and with ω but decreases monotonically with
mx, see Deng and Dodson [6] for more details. In fact, for a Poisson process
of rectangles the variance of coverage at points is precisely the mean coverage,
V ar (c(0)) = c̄, so if we agree to measure coverage as a fraction of the mean cover-
age then Eq. (13.6) reduces to the integral

⎥ 2x
V ar (c̃(x))
= α(r, ω, λ) b(r ) dr = ρ̃(x). (13.9)

0

Now, the covariance among points inside mx-pixels, Cov(0, mx), is the expec-
tation of the covariance between pairs of points separated by distance r, taken over
the possible values for r in an mx-pixel; that amounts to the integral in Eq. (13.6).
By this means we have continuous families of 2 × 2 covariance matrices for x ◦ R+
and 2 < m ◦ Z+ given by
   
σ11 σ12 V ar (c̃(x)) Cov(x, mx)
Δ x,m = =
σ12 σ22 Cov(x, mx) V ar (c̃(x))
 
ρ̃(x) ρ̃(mx)
= . (13.10)
ρ̃(mx) ρ̃(x)

which encodes information about the spatial structure formed from the Poisson
process of rectangles, for each choice of rectangle dimensions ω ∗ λ ◦ R+ . This can
be extended to include mixtures of different rectangles with given relative abundances
and processes of more complex objects such as Poisson clusters of rectangles.
There is a one dimensional version of the above that is discussed in [6, 7], with
point autocorrelation calculated easily as

1− r
λ 0∗r ∗λ
α(r ) = (13.11)
0 λ < r.

Also, the probability density function for points chosen independently and at
random with separation r in a pixel, which is here an interval of length x, is

2⎞ r⎠
b(r ) = 1− (0 ∗ r ∗ x). (13.12)
x x
Then the integral (13.6) gives the fractional between pixel variance as

1−

x
3λ ⎠ 0∗x ∗λ
ρ̃(x, λ) = λ λ (13.13)
x 1− 3x λ < x.

So in the case of a one dimensional stochastic texture from a Poisson process of


segments of length λ we have the explicit expression for the covariance matrices in
376 C. T. J. Dodson and W. W. Sampson

Eq. (13.10):  
ρ̃(x, λ) ρ̃(mx, λ)
Δ x,m
(λ) = . (13.14)
ρ̃(mx, λ) ρ̃(x, λ)

In particular, if we take unit length intervals as the base pixels, for the Poisson
process of unit length line segments, x = λ = 1 we obtain
 ⎭ 
(1
⎭ − 13 )  m ⎭ 1 − 3m
1 1

Δ 1,m
(1) = for m = 2, 3, . . . . (13.15)
m 1 − 3m 1 − 13
1 1

13.4 Information Distance

Given the family of pixel density distributions, with associated spatial covariance
structure among neighbours, we can use the Fisher metric [1] to yield an arc length
function on the curved space of parameters which represent mean and covariance
matrices. Then the information distance between any two such distributions is given
by the length of the shortest curve between them, a geodesic, in this space. The
computational difficulty is in finding the length of this shortest curve since it is the
infimum over all curves between the given two points. Fortunately, in the cases we
need, multivariate Gaussians, this problem has been largely solved analytically by
Atkinson and Mitchell [2].
Accordingly, some of our illustrative examples use information geometry of
trivariate Gaussian spatial distributions of pixel density with covariances among
first and second neighbours to reveal features related to sizes and density of clusters,
which could arise in one, two or three dimensions. For isotropic spatial processes,
which we consider here, the variables are means over shells of first and second neigh-
bours, respectively. For anisotropic networks the neighbour sets would be split into
more new variables to pick up the spatial anisotropy in the available spatial directions.
Other illustrations will use the analytic bivariate covariances given in Sect. 13.3
by Eq. (13.10).
What we know analytically is the geodesic distance between two multivariate
Gaussians, A, B, of the same number n of variables in two particular cases [2]:
1. μA →= μB ,  A =  B =  : f A = (n, μ A , Δ), f B = (n, μ B , Δ)

⎭ T ⎭ 
Dμ ( f , f ) =
A B
μ A − μ B · Δ −1 · μ A − μ B . (13.16)

2. μA = μB = μ,  A →=  B : f A = (n, μ, Δ A ), f B = (n, μ, Δ B )

 ⎟
1 n
DΔ ( f A , f B ) =  log2 (λ j ), (13.17)
2
j=1
13 Dimensionality Reduction for Classification of Stochastic Texture Images 377

Fig. 13.8 Plot of DΔ ( f A , f B ) from (13.17) against ΦΔ ( f A , f B ) from (13.18) for 185 different
trivariate Gaussian covariance matrices

−1/2 −1/2
with {λ j } = Eig(Δ A · ΔB · ΔA ).

In the present paper we use Eqs. (13.16) and (13.17) and take the simplest choice
of a linear combination of both when both mean and covariance are different.
However, from the form of DΔ ( f A , f B ) in (13.17) we deduce that an approximate
monotonic relationship arises with a more easily computed symmetrized log-trace
function given by

ΦΔ ( f A , f B )
⎧  
1 ⎭ −1/2 −1/2 −1/2 −1/2 
= log T r (Δ A ·Δ ·Δ
B A ) + T r (Δ B ·Δ ·Δ
A B .
2n
(13.18)

This is illustrated by the plot in Fig. 13.8 of DΔ ( f A , f B ) from Eq. (13.17) on


ΦΔ ( f A , f B ) from Eq. (13.18) for 185 trivariate Gaussian covariance matrices, where
we see that
DΔ ( f A , f B ) ⊂ 1.7ΦΔ ( f A , f B ).

A commonly used approximation for information distance is obtained from the


Kullback–Leibler divergence, or relative entropy. Between two multivariate Gaus-
sians f A = (n, μ A , Δ A ), f B = (n, μ B , Δ B ) with the same number n of variables,
its square root gives a separation measurement [16]:
378 C. T. J. Dodson and W. W. Sampson
 
1 det Δ B 1 −1
K L( f , f ) = log
A B
+ t Tr[Δ B · Δ A ]
2 det Δ A 2
1⎞ A ⎠T −1
⎞ ⎠ n
+ μ − μB · Δ B · μ A − μB − . (13.19)
2 2
This is not symmetric, so to obtain a distance we take the average KL-distance in
both directions:

|K L( f A , f B )| + |K L( f B , f A )|
DK L ( f A , f B ) = (13.20)
2

The Kullback–Leibler distance tends to the information distance as two distributions


become closer together; conversely it becomes less accurate as they move apart.
For comparing relative proximity, ΦΔ ( f A , f B ) is a better measure near zero
than the symmetrized Kullback–Leibler D K L ( f A , f B ) distance in those multivariate
Gaussian cases so far tested and may be computationally quicker for handling large
batch processes.

13.5 Dimensionality Reduction of Spatial Density Arrays

We shall illustrate the differences of spatial features in given data sets obtained
from the distribution of local density for real and simulated planar stochastic fibre
networks. In such cases there is benefit in mutual information difference comparisons
of samples in the set but the difficulty is often the large number of samples in a set
of interest—perhaps a hundred or more. Human brains can do this very well; the
enormous numbers of optical sensors that stream information from the eyes into the
brain with the result that we have a 3-dimensional reduction which serves to help us
‘see’ the external environment. We want to see a large data set organised in such a
way that natural groupings are revealed and quantitative dispositions among groups
are preserved. The problem is how to present the information contained in the whole
data set, each sample yielding a 3 × 3 covariance matrix Δ and mean μ. The optimum
presentation is to use a 3-dimensional plot, but the question is what to put on the
axes.
To solve this problem we use multi-dimensional scaling, or dimensionality reduc-
tion, to extract the three most significant features from the set of samples so that all
samples can be displayed graphically in a 3-dimensional plot. The aim is to reveal
groupings of data points that correspond to the prominent characteristics; in our
context we have different former types, grades and differing scales and intensities
of fibre clustering. Such a methodology has particular value in the quality control
for processes with applications that frequently have to study large data sets of sam-
ples from a trial or through a change in conditions of manufacture or constituents.
Moreover, it can reveal anomalous behaviour of a process or unusual deviation in a
13 Dimensionality Reduction for Classification of Stochastic Texture Images 379

product. The raw data of one sample from a study of spatial variability might typi-
cally consist of a spatial array of 250 × 250 pixel density values, so what we solve
is a problem in classification for stochastic image textures.
The method, which we introduced in a preliminary report [11], depends on extract-
ing the three largest eigenvalues and their eigenvectors from a matrix of mutual infor-
mation distances among distributions representing the samples in the data set. The
number in the data set is unimportant, except for the computation time in finding
eigenvalues. This follows the methods described by Carter et al. [4, 5]. Our study
is for datasets of pixel density arrays from complete sampling of density maps of
stochastic textures which incorporate spatial covariances. We report the results of
such work on a large collection of radiographs from commercial papers made from
continuous filtration of cellulose and other fibres, [9].
The series of computational stages is as follows:
1. Obtain mutual ‘information distances’ D(i, j) among the members of the data set
of N textures X 1 , X 2 , . . . , X N using the fitted trivariate Gaussian pixel density
distributions.
2. The array of N × N differences D(i, j) is a real symmetric matrix with zero
diagonal. This is centralized by subtracting row and column means and then
adding back the grand mean to give CD(i, j).
3. The centralized matrix CD(i, j) is again a real symmetric matrix with zero diag-
onal. We compute its N eigenvalues ECD(i), which are necessarily real, and the
N corresponding N -dimensional eigenvectors VCD(i).
4. Make a 3 × 3 diagonal matrix A of the first three eigenvalues of largest absolute
magnitude and a 3 × N matrix B of the corresponding eigenvectors. The matrix
product A · B yields a 3 × N matrix and its transpose is an N × 3 matrix T, which
gives us N coordinate values (xi , yi , z i ) to embed the N samples in 3-space.
Example: Bivariate Gaussians

1 −1
f (x, y) = ∈ ex p 2 (y − μ2 )2 σ11 + (x − μ1 )[(x − μ1 )σ22 + 2(−y + μ2 ))σ12 ],
2π Φ Φ
μ = (μ1 , μ2 ),
Φ = Det[Δ] = σ11 σ22 − σ122
,
       
σ11 σ12 10 01 00
Δ= = σ11 + σ12 + σ22 ,
σ12 σ22 00 10 01
 σ22 σ12 
Δ −1 = Φ − Φ .
σ12 σ11
−Φ Φ

Put δμi = (μiA − μiB ).


Then we have
380 C. T. J. Dodson and W. W. Sampson

Dμ ( f A , f B ) = δμT · Δ −1 · δμ

δμ2 (σ11 δμ2 − σ12 δμ1 ) δμ1 (σ22 δμ1 − σ12 δμ2 )
= + .
Φ Φ

Numerical example:
     
10 32 B −1 3/7 −1/7
Δ =A
, Δ =
B
, Δ =
01 26 −1/7 3/14
     
A −1/2 A−1/2 10 32 10 32
Δ ·Δ ·Δ
B
= = ,
01 26 01 26

with eigenvalues: λ1 = 7, λ2 = 2.

 ⎟
1 n
DΔ (Δ A , Δ B ) =  log2 (λ j ) ⊂ 1.46065
2
j=1


7+2
ΦΔ (Δ , Δ ) =
A B
log ⊂ 0.9005.
4

For comparison, the symmetrized Kullback–Leibler distance [16] is given by


  
1 1 19 1 1 7
D K L (Δ , Δ ) =
A B
log 14 − + log + ⊂ 1.1386.
2 2 28 2 14 2

13.6 Analysis of Samples

13.6.1 Analytic Results for Poisson Processes of Line Segments


and Rectangles

We provide here some graphics showing three dimensional embeddings of Pois-


son processes that yield stochastic textures of pixel density, using the analysis in
Sect. 13.3.
Figure 13.9 shows an embedding of 20 samples calculated for a Poisson line
process of line segments, (13.15), with x = λ = 1 and m = 2, 3, . . . 21. The starting
green point in the lower right is for m = 2 and the red end point is for m = 21.
Figure 13.10 shows an embedding of 18 samples calculated for a planar Poisson
process of unit squares, from (13.10), with ω = λ = 1. It shows the separation into
two groups of samples: analysed with small base pixels, x = 0.1 right, and with large
base pixels, x = 1 left. Figure 13.11 shows an embedding of 18 samples calculated
13 Dimensionality Reduction for Classification of Stochastic Texture Images 381

Fig. 13.9 Embedding of 20 evaluations of information distance for the bivariate covariances arising
from a Poisson line process of line segments, (13.15), with x = λ = 1 and m = 2, 3, . . . 21. The
starting green point in the lower right is for m = 2 and the red end point is for m = 21

Fig. 13.10 Embedding of 18 evaluations of information distance for the bivariate covariances
arising from a planar Poisson process of squares, (13.10), with ω = λ = 1. The two groups arise
from different schemes of inspection pixels. Right group used small base pixels with x = 0.1, from
blue to pink m = 2, 3, . . . , 10; left group used large base pixels with x = 1, from green to red
m = 2, 3, . . . , 10
382 C. T. J. Dodson and W. W. Sampson

Fig. 13.11 Embedding of 22 evaluations of information distance for the bivariate covariances
arising from a planar Poisson process of rectangles, (13.10), with ω = 0.2, λ = 1. The two groups
arise from different schemes of inspection pixels. Left group used large base pixels x = 1, from
green to red m = 2, 3, . . . , 10; right group used small base pixels x = 0.1, from blue to pink
m = 2, 3, . . . , 10

for a planar Poisson process of rectangles with aspect ratio 5:1, from (13.10), with
ω = 0.2, λ = 1. Again it shows the separation into two groups of samples analysed
with small pixels, right, and with large pixels, left.

13.6.2 Deviations from Poisson Arising from Clustering

Our three spatial variables for each spatial array of data are the mean density in a cen-
tral pixel, mean of its first neighbours, and mean of its second neighbours. We begin
with analysis of a set of 16 samples of areal density maps for simulated stochastic
fibre networks made from the same number of 1 mm fibres but with differing scales
(clump sizes) and intensities (clump densities) of fibre clustering. Among these is the
standard unclustered Poisson fibre network; all samples have the same mean density.
Figure 13.12 gives analyses for spatial arrays of pixel density differences from
Poisson networks. It shows a plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left),
and the same data grouped by numbers of fibres in clusters and cluster densities
(right), for geodesic information distances among 16 datasets of 1 mm pixel density
differences between a Poisson network and simulated networks made from 1 mm
fibres. Each network has the same mean density but with different scales and densities
13 Dimensionality Reduction for Classification of Stochastic Texture Images 383

0.0 0.2
−0.2

0.1

0.0

−0.1

−0.2
−0.5
0.0
0.5

Fig. 13.12 Pixel density differences from Poisson networks. Left plot of DΔ ( f A , f B ) as a cubic-
smoothed surface, for trivariate Gaussian information distances among 16 datasets of 1 mm pixel
density differences between a Poisson network and simulated networks made from 1 mm fibres,
each network has the same mean density but with different clustering. Right embedding of the same
data grouped by numbers of fibres in clusters and cluster densities

of clustering; thus the mean difference is zero in this case. Using pixels of the order of
fibre length is appropriate for extracting information on the sizes of typical clusters.
The embedding reveals the clustering features as orthogonal subgroups.
Next, Fig. 13.13 gives analyses for pixel density arrays of the clustered networks.
It shows on the left the plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left)
for trivariate Gaussian information distances among the 16 datasets of 1 mm pixel
densities for simulated networks made from 1 mm fibres, each network with the same
mean density but with different clustering. In this case the trivariate Gaussians all
have the same mean vectors. Shown on the right is the dimensionality reduction
embedding of the same data grouped by numbers of fibres in clusters and cluster
densities; the solitary point is a Poisson network of the same fibres.

13.6.3 Effect of Mean Density in Poisson Structures

Figure 13.14 gives analyses for pixel density arrays for Poisson networks of different
mean density. It shows the plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left),
for trivariate Gaussian information distances among 16 simulated Poisson networks
made from 1 mm fibres, with different mean density, using pixels at 1 mm scale. Also
shown is, (right) dimensionality reduction embedding of the same Poisson network
data, showing the effect of mean network density.
384 C. T. J. Dodson and W. W. Sampson

0.2

0.0

−0.2
0.5

0.0
−1
0
1

Fig. 13.13 Pixel density arrays for clustered networks: Left plot of DΔ ( f A , f B ) as a cubic-
smoothed surface, for trivariate Gaussian information distances among 16 datasets of 1 mm pixel
density arrays for simulated networks made from 1 mm fibres, each network with the same mean
density but with different clustering. Right embedding of the same data grouped by numbers of
fibres in clusters and cluster densities; the solitary point is an unclustered Poisson network

0.5

0.0

0.2

0.0

− 0.2

Fig. 13.14 Pixel density arrays for Poisson networks of different mean density. Left plot of
DΔ ( f A , f B ) as a cubic-smoothed surface (left), for trivariate Gaussian information distances
among 16 simulated Poisson networks made from 1 mm fibres, with different mean density, using
pixels at 1 mm scale. Right embedding of the same Poisson network data, showing the effect of
mean network density
13 Dimensionality Reduction for Classification of Stochastic Texture Images 385

4
2
0

0.5

0.0

– 0.5

– 1.0

–5
0

Fig. 13.15 Embedding using 182 trivariate Gaussian distributions for samples from the data set [9].
Blue points are from gap formers; orange are various handsheets, purple are from pilot paper
machines and green are from hybrid formers. The embedding separates these different forming
methods into subgroups

13.6.4 Analysis of Commercial Samples

Figure 13.15 shows a 3-dimensional embedding for a data set from [9] including
182 paper samples from gap formers, handsheets, pilot machine samples and hybrid
formers. We see that to differing degrees the embedding separates these different and
very disparate forming methods by assembling them into subgroups. This kind of
discrimination could be valuable in evaluating trials, comparing different installations
of similar formers and for identifying anomalous behaviour.
The benefit from these analyses is the representation of the important structural
features of number of fibres per cluster and cluster density, by almost orthogonal
subgroups in the embedding.

13.6.5 Analysis of Saccharomyces Cerevisiae Yeast and Human


Genomes

This yeast is the genome studied in [3] for which we showed that all 20 amino
acids along the protein chains exhibited mutual clustering, and separations of 3–12
are generally favoured between repeated amino acids, perhaps because this is the
386 C. T. J. Dodson and W. W. Sampson

Fig. 13.16 Determinants of 12-variate spatial covariances for 20 samples of yeast amino acid
sequences, black Y, together with three Poisson sequences of 100,000 amino acids with the yeast
relative abundances, blue RY. Also shown are 20 samples of human sequences, red H, and three
Poisson sequences of 100,000 amino acids with the human relative abundances, green RH.
0.15
0.10
0.05
0.00

0.1

0.0

- 0.1

- 0.2

0.0

0.2

Fig. 13.17 Twelve-variate spatial covariance embeddings for 20 samples of yeast amino acid
sequences, small black points, together with three Poisson sequences of 100,000 amino acids with
the yeast relative abundances, large blue points. Also shown are 20 human DNA sequences, medium
red points, and three Poisson sequences of 100,000 amino acids with the human relative abundances,
large green points
13 Dimensionality Reduction for Classification of Stochastic Texture Images 387

usual length of secondary structure, cf. also [1]. The database of sample sequences
is available on the Saccharomyces Genome Database [19]. Here we mapped the
sequences of the 20 amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V,
W, Y onto the 20 grey-level values 0.025, 0.075, . . . , 0.975 so yielding a grey-level
barcode for each sequence, Fig. 13.1. Given the usual length of secondary structure
to range from 3 to 12 places along a sequence, we used spatial covariances between
each pixel and its successive 12 neighbours. Figure 13.16 plots the determinants of
the 12-variate spatial covariances of 20 for yeast, black Y, together with three Poisson
random sequences of 100,000 amino acids with the yeast relative abundances, blue
RY. Also shown are 20 samples of human sequences, red H, and three Poisson
sequences of 100,000 amino acids with the human relative abundances, green RH.
Figure 13.17 shows an embedding of these 20 12-variate spatial covariances for yeast,
small black points, together with three Poisson sequences of 100,000 amino acids
with the yeast relative abundances, large blue points, and 20 human DNA sequences,
medium red points using data from the NCBI Genbank Release 197.0 [15], and three
Poisson sequences of 100,000 amino acids with the human relative abundances,
large green points. The sequences ranged in length from 340 to 1,900 amino acids.
As with the original analysis of recurrence spacings [3] which revealed clustering,
the difference of the yeast and human sequence structures from Poisson is evident.
However, it is not particularly easy to distinguish yeast from human sequences by
this technique, both lie in a convex region with the Poisson sequences just outside,
but there is much scatter. Further analyses of genome structures will be reported
elsewhere.

References

1. Arwini, K., Dodson, C.T.J.: Information geometry near randomness and near independence.
In: Sampson, W.W. (eds.) Stochasic Fibre Networks (Chapter 9), pp. 161–194. Lecture Notes
in Mathematics, Springer-Verlag, New York, Berlin (2008)
2. Atkinson, C., Mitchell, A.F.S.: Rao’s distance measure. Sankhya: Indian J. Stat. Ser. A 48(3),
345–365 (1981)
3. Cai, Y., Dodson, C.T.J .Wolkenhauer, O. Doig, A.J.: Gamma distribution analysis of protein
sequences shows that amino acids self cluster. J. Theor. Biol. 218(4), 409–418 (2002)
4. Carter, K.M., Raich, R., Hero, A.O.:. Learning on statistical manifolds for clustering and
visualization. In 45th Allerton Conference on Communication, Control, and Computing, Mon-
ticello, Illinois. (2007). https://wiki.eecs.umich.edu/global/data/hero/images/c/c6/Kmcarter-
learnstatman.pdf
5. Carter, K.M.: Dimensionality reduction on statistical manifolds. Ph.D. thesis, University of
Michigan (2009). http://tbayes.eecs.umich.edu/kmcarter/thesis
6. Deng, M., Dodson, C.T.J.: Paper: An Engineered Stochastic Structure. Tappi Press, Atlanta
(1994)
7. Dodson, C.T.J.: Spatial variability and the theory of sampling in random fibrous networks. J.
Roy. Statist. Soc. B 33(1), 88–94 (1971)
8. Dodson, C.T.J.: A geometrical representation for departures from randomness of the inter-
galactic void probablity function. In: Workshop on Statistics of Cosmological Data Sets NATO-
ASI Isaac Newton Institute, 8–13 August 1999. http://arxiv.org/abs/0811.4390
388 C. T. J. Dodson and W. W. Sampson

9. Dodson, C.T.J., Ng, W.K., Singh, R.R.: Paper: stochastic structure analysis archive. Pulp and
Paper Centre, University of Toronto (1995) (3 CDs)
10. Dodson, C.T.J., Sampson, W.W.: In Advances in Pulp and Paper Research, Oxford 2009. In:
I’Anson, S.J., (ed.) Transactions of the XIVth Fundamental Research Symposium, pp. 665–691.
FRC, Manchester (2009)
11. Dodson, C.T.J., Sampson, W.W.: Dimensionality reduction for classification of stochastic fibre
radiographs. In Proceedings of GSI2013—Geometric Science of Information, Paris, 28–30:
Lecture Notes in Computer Science 8085. Springer-Verlag, Berlin (August 2013)
12. Doroshkevich, A.G., Tucker, D.L., Oemler, A., Kirshner, R.P., Lin, H., Shectman, S.A., Landy,
S.D., Fong, R.: Large- and superlarge-scale structure in the las campanas redshift survey. Mon.
Not. R. Astr. Soc. 283(4), 1281–1310 (1996)
13. Ghosh, B.: Random distances within a rectangle and between two rectangles. Calcutta Math.
Soc. 43(1), 17–24 (1951)
14. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1980)
15. NCBI Genbank of The National Center for Biotechnology Information. Samples from
CCDS_protein. 20130430.faa.gz. ftp://ftp.ncbi.nlm.nih.gov/genbank/README.genbank
16. Nielsen, F., Garcia, V., Nock, R.: Simplifying Gaussian mixture models via entropic quanti-
zation. In: Proceedings of 17th European Signal Processing Conference, Glasgow, Scotland
24–28 August 2009, pp. 2012–2016
17. Sampson, W.W.: Modelling Stochastic Fibre Materials with Mathematica. Springer-Verlag,
Berlin, New York (2009)
18. Sampson, W.W.: Spatial variability of void structure in thin stochastic fibrous materials. Mod.
Sim. Mater. Sci. Eng. 20:015008 pp13 (2012). doi:10.1088/0965-0393/20/1/015008
19. Saccharomyces Cerevisiae Yeast Genome Database. http://downloads.yeastgenome.org/
sequence/S288C_reference/orf_protein/
Index

Symbols B
α-Conformally equivalent, 34, 62, 89 Balian quantum metric, 179
α-Conformally flat, 34, 62 Beta-divergence, 45
α-Hessian, 1, 2, 9, 10, 22–25, 28, 29 Bias χ-correctedscore function, 72
α-connection, 63, 88 Bias corrected q-score function, 73
α-divergence, 75 Bivariate Gaussian, 379
χ-Fisher metric, 73 Buchberger’s algorithm, 137
χ-cross entropy, 74
χ-cross entropy of Bregman type, 72
C
χ-cubic form, 73
1-conformally equivalent, 75
χ-divergence, 72, 73 1-conformally flat, 75
χ-exponential connection, 73 Canonical divergence, 35, 60
χ-exponential function, 66 Carl Ludwig Siegel, 147, 162, 196
χ-logarithm function, 65 Cartan–Killing form, 206, 207
χ-mixture connection, 73 Cartan–Schouten connections, 248
χ-score function, 69, 74 Cartan–Siegel domains, 193
η-potential, 60 CEM algorithm, 306
θ-potential, 60 Central limit theorem, 371
Centroid, 283
Clustered networks, 383
Clustering, 367
A Clusters of fibres, 371
Affine connection, 246 Complete lattice, 331, 332, 334, 346, 348,
Affine coordinate system, 59 349, 353, 354, 364, 365
Affine harmonic, 83 Computational anatomy, 274
Constant curvature, 34
Affine hypersurface theory, 42
Contact geometry, 141, 148, 165, 167, 168
Algebraic estimator, 124
Contrast function, 61
Algebraic statistics, 119
Controllable realization, 232
Alignment distance, 237 Convex cones, 94, 141, 146, 147, 151, 161
Aluminium foam, 369 Convex function, 1, 13, 26, 29
Alzheimer’s disease, 266 Cosmological voids, 369
Amino acids, 368 Cotton yarn, 368
Anomolous behaviour, 378 Cross entropy, 65
Areal density arrays, 368 Crouzeix relation, 157
Autocorrelation, 375 Cubic form, 59, 60, 64

F. Nielsen (ed.), Geometric Theory of Information, 389


Signals and Communication Technology, DOI: 10.1007/978-3-319-05317-2,
© Springer International Publishing Switzerland 2014
390 Index

Currents, 278, 286 Fisher metric, 63, 376


Curvature, 246 Flat, 59
Curvature tensor, 58 Foliated structure, 39
Curved q-exponential family, 78 Fréchet distance, 144, 202–204
Fréchet mean, 238, 283
François Massieu, 146, 169
D
Deformed exponential family, 66, 71, 74
Deformed exponential function, 66 G
Deformed logarithm function, 65 Gaussian distribution, 331
Density map, 372 Gaussian distribution-valued images, 334,
Deviation from Poisson, 382 353, 362
Diffeomorphism, 276 Generalized dynamic factor models, 228
Dimensionality reduction, 367 Generalized entropy functional, 71
Distance structure, 368 Generalized Massieu potential, 71
Distribution-valued, 363, 365 Generalized relative entropy, 74
Distribution-valued images, 331, 334, 353, Generating functions, 142, 146, 164, 209
357, 361, 362, 365 Genome, human, 387
Divergence function, 1, 2, 16–19, 22, 25 Genome, yeast, 385
Divergences Genomes, 367
Bregman divergence, 304 Geodesic, 247
Cauchy–Schwartz divergence, 321 Geodesic shooting, 278, 290
Kullback-Leibler divergence, 304 Geometric temperature, 141, 174, 187
LogDet divergence, 318 Grey-level barcode, 368
Dual connection, 58, 87 Gröbner basis, 137
Dual coordinate system, 60 Gromov inner product, 142, 147, 205–207
Dually flat space, 60 Group action induced distances, 234

E
e-(exponential) representation, 65 H
Efficient estimator Harmonic function, 83
first order, 123 Harmonic map, 84
ˆ 85
Harmonic map relative to (g, ∇),
second order, 123
¯ 90
Harmonic map relative to (h, ∇, ∇),
Eigenvalue, 379
Eigenvector, 379 Hellinger distance, 225
EM algorithm, 305 Henri Poincaré, 146, 182
e-(mixture) representation, 65 Hessian manifold, 59
Equiaffine geometry, 28 Hessian structure, 59
Escort distribution, 67 Hessian domain, 34
Euler-Lagrange equation, 84 High-dimensional stochastic processes, 220
Exponential connection, 63 High-dimensional time series, 220
Exponential family, 64, 302 Hippocampus, 288
algebraic cureved, 124 Homotopy continuation method, 138
curved, 121 Hyperbolic, 332, 334, 335, 337, 338, 340,
full, 121 341, 343, 347, 348, 358, 361, 363,
MLE for an, 303 364
Extrinsic distance, 222 Hyperbolic partial ordering, 332, 345–347,
349, 353, 355, 356, 358

F
Fibre network simulator, 371 I
First and second neighbours, 367 Induced Hessian manifold, 61
Fisher information matrix, 63 Induced statistical manifold, 61
Index 391

Information geometry, 141–143, 145, 146, Maximum q-likelihood estimator, 77


148–151, 187, 193, 194, 204, 205 MIMO systems, 234
Information geometry image filtering, 332, Minimal realization, 232
334, 341, 342, 365 Minimum phase filter, 227
Innovations process, 226 Misha Gromov, 148, 206
Intrinsic distance, 222 Mixture connection, 63
Invariant statistical manifold, 64, 75 Mixture model, 301
Itakura–Saito divergence, 228 Moment map, 174, 209
Iterative centroid, 284 Multidimensional scaling, 367

J N
Jean-Louis Koszul, 145, 147, 151, 159, 162, Nanofibrous network, 370
205 Normal form, 129
Jean-Marie Souriau, 146, 147, 174, 176 Normalized Tsallis relative entropy, 75

K O
Kähler geometry, 1, 15 Observable realization, 232
Kähler affine manifold, 83 One-parameter subgroups, 246
Kähler affine structure, 83 Optimal mass transport, 225
k-MLE algorithm, 306 Ordered Poincaré half-plane, 334
Hartigan’s method for, 308
Initialization of, 310, 312
Lloyd’s method for, 307 P
Koszul characteristic function, 145, 146, Parallel transport, 249
151, 157 Pattern recognition, 220
Koszul Entropy, 142, 143, 146, 148, 151, Pierre Duhem, 146
153, 156, 157 Pixel density arays, 379
Koszul forms, 160 Poisson clusters, 368
Kullback–Leibler, 377 Poisson line process, 380
Kullback-Leibler divergence, 64 Poisson process, 367, 380
Poisson process of rectangles, 375
Poisson rectangle process, 382
L Pole ladder, 252
Laplace Principle, 146, 163, 164, 180, 181 Positive definite matrix, 141, 147, 187
Laplacian, 83 Power spectral density matrix, 224
Laplacian of the gradient mapping, 88 Power potential, 39
LDDMM, 256, 275 Principal fiber bundle, 233
Legendre transform, 86 Pythagorean relation, 43
Levi-Civita connection, 249
Lie groups, 245
Linear dynamical systems, 220 Q
Linear predictive coding, 220 Q-Covariant derivative, 114
Liouville measure, 176 Q-cubic form, 73
Lookeng Hua, 197, 198 Q-divergence functional, 112
Q-exponential, 66
Q-exponential family, 75
M Q-exponential function, 98
Marginal distributions, 371 Q-exponential manifold, 108, 110
Mathematical morphology, 331, 332 Q-exponential model, 105, 108, 110
Maurice Fréchet, 144, 147, 148, 202, 204 Q-Fisher metric, 73
Maximum likelihood estimator (MLE), 120, Q-independent, 76
303 q-likelihood function, 77
392 Index

Q-log-likelihood function, 77 Symmetric matrix, 379


Q-logarithm, 66 Symplectic geometry, 1, 2, 15
Q-normal distributions, 68
Q-product, 76
Quality control, 378 T
Quotient space, 233 Tall, full rank linear dynamical system, 232
Template estimation, 274, 281
Thermodynamics, 141, 145, 146, 165, 168–
R 173, 176, 178, 184
Radiographs of paper, 370 Toeplitz matrix, 142, 147, 193
Realization balancing, 235 Torsion, 246
Realization bundle, 233 Torsion tensor, 58
Reduction of the structure group, 235
Relative entropy, 64
Riemannian metric, 113 U
Riemannian symmetric space, 41 U-divergence, 46, 72
U-model, 46
Universal barrier, 162
S
Schild’s ladder, 251
Simulated fibre networks, 383 V
Smooth canonical forms, 234 Vectorial ARMA models, 221
Spatial covariance, 367, 371 V -Potential, 36
Spatial distributions, 367
Spectral factorization, 226
Stationary velocity field, 256 W
Statistical manifold, 59, 87 Wishart distribution, 313
Statistical model, 62 MLE for the, 315
Statistical structure, 1, 5, 7, 16, 28
Statistical (Codazzi) manifold, 34
Stochastic networks, 368 Y
Stochastic textures, 367 Yeast genome, 369
Symmetric cone, 95 Yoke, 65

You might also like