0% found this document useful (0 votes)
51 views24 pages

Int Statistical Rev - 2021 - Amari - Information Geometry

The document reviews the field of Information Geometry, which studies the geometric structure of statistical models represented as manifolds of probability distributions. Initiated by Professor C. R. Rao in 1945, it highlights the significance of the Fisher information matrix as a metric tensor and explores various applications across statistics, AI, and other sciences. The paper also discusses the development of higher-order asymptotic theories and dual geometries in statistical inference, celebrating Rao's contributions to the field.

Uploaded by

markcguinto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views24 pages

Int Statistical Rev - 2021 - Amari - Information Geometry

The document reviews the field of Information Geometry, which studies the geometric structure of statistical models represented as manifolds of probability distributions. Initiated by Professor C. R. Rao in 1945, it highlights the significance of the Fisher information matrix as a metric tensor and explores various applications across statistics, AI, and other sciences. The paper also discusses the development of higher-order asymptotic theories and dual geometries in statistical inference, celebrating Rao's contributions to the field.

Uploaded by

markcguinto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

International Statistical Review (2021), 89, 2, 250–273 doi: 10.1111/insr.

12464

Information Geometry
Shun-ichi Amari1,2
1
ACRO, Teikyo University, Tokyo, Japan
2
RIKEN Center for Brain Science, Saitama, Japan
E-mail: [email protected]

Summary
Statistical inference is constructed upon a statistical model consisting of a parameterised family
of probability distributions, which forms a manifold. It is important to study the geometry of the
manifold. It was Professor C. R. Rao who initiated information geometry in his monumental paper
published in 1945. It not only included fundamentals of statistical inference such as the
Cramér–Rao theorem and Rao–Blackwell theorem but also proposed differential geometry of a
manifold of probability distributions. It is a Riemannian manifold where Fisher–Rao information
plays the role of the metric tensor. It took decades for the importance of the geometrical structure
to be recognised. The present article reviews the structure of the manifold of probability distribu-
tions and its applications and shows how the original idea of Professor Rao has been developed and
popularised in the wide sense of statistical sciences including AI, signal processing, physical sciences
and others.

Key words: information geometry; Fisher–Rao information; dual affine connections; generalised
Pythagorean theorem.

1 Introduction
Information geometry studies the structure of a regular statistical model, which forms a
manifold M. It consists of probability distributions parameterised by an m-dimensional vector,
where the parameters constitute a coordinate system. It is a natural question to search how
different two distributions in M are, that is, a distance or divergence between two distributions.
It was Professor Rao’s monumental paper (Rao, 1945) that answered the question. He
proposed a fundamental theory to show that M is a Riemannian manifold, where the Fisher
information matrix plays the role of the Riemannian metric. At the same time, Rao presented
a fundamental theorem in statistics, which was later called the Cramér–Rao theorem, because
it was also independently proved by Cramér (1946). Rao calculated the Riemannian distance
between two Gaussian distributions with different means and variances. The present review
summarises Information Geometry initiated by C. R. Rao to show how Rao’s idea has been
developed.
The geometrical approach was so fundamental that it took decades for researchers to develop
his idea. Chentsov (1982) followed the idea and answered the question of why the Fisher
information should be used, by proposing an invariance criterion. The Fisher information matrix
which Rao used is unique from the point of view of invariance under Markovian morphisms of
statistical models. He further showed that a third-order symmetric tensor exists and is also
unique. These two tensors together define invariant affine connections to be introduced in M.
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 251

A higher-order asymptotic theory of statistical inference was proposed by Rao (1962),


where he showed that the maximum-likelihood estimator (MLE) keeps the maximum amount
of information up to higher orders. When the number of observations, n, is large, it was
known that the MLE is an optimal estimator, by expanding the estimation error in series of
O(1/n), where the ‘higher order’ implies terms of O(1/n2). By the early 1980s, higher-order
asymptotic theories had been developed widely. Efron (1975) proposed a geometrical theory
that the higher-order quantities are related to statistical curvatures, by developing the idea of
Fisher’s unpublished research memorandum. Dawid (1975) supplemented the idea of Efron
by proposing the mixture affine connection, whereas Efron used the exponential affine
connection. Both belong to the class of invariant affine connections proposed by
Chentsov (1982).
Following the ideas of Efron (1975) and Dawid (1975), Amari (1982) developed a
geometrical theory of higher-order asymptotics, where two types of statistical curvatures play
a fundamental role. The geometrical structure is the same as that of Chentsov (1982). Nagaoka
and Amari (1982) further developed a dual theory of information geometry, where the two
affine connections are shown to be dually coupled. In particular, the geometry of an exponential
family of probability distributions is shown to be a dually flat Riemannian manifold. Such a
manifold has nice geometrical properties; for instance, the generalised Pythagorean theorem
and dual projection theorems hold. These properties are very useful for applications to various
problems, not only in statistics but in machine learning, signal processing, game theory, physics
and so on (see Amari, 2016).
The present paper celebrating Professor Rao’s 100th anniversary reviews the geometry of
manifolds of probability distributions initiated by C. R. Rao and suggests some useful
applications. Section 2 defines a statistical manifold and dual geometry composed on it.
Section 3 focusses on a dually flat manifold based on exponential family of distributions.
Section 4 touches upon estimation functions in a semiparametric statistical model. A solution
is given to the Neyman–Scott problem. Section 5 deals with hierarchical statistical models.
Finally, Section 6 provides conclusions.

2 Statistical Manifold and Dual Geometry


Let

M ¼ fpðx; ξÞg (1)

be a regular statistical model parameterised by an m-dimensional parameter vector ξ ¼


ðξ 1 ; …; ξ m Þ, where pðx; ξÞ is the probability density function of a random variable x. M is
an m-dimensional manifold, where ξ is a (local) coordinate system.
The score function sðx; ξÞ is an m-dimensional vector whose components are defined by the
derivatives of log probability,

si ðx; ξÞ ¼ ∂i log pðx; ξÞ; i ¼ 1; 2; …; m; (2)


where ∂i ¼ .
∂ξ i
This is a fundamental quantity in statistics. For n independent observations x1, … , xn, the
MLE ξ^ is the maximiser of the log likelihood and is obtained as the solution of the likelihood
equations
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
252 S. AMARI

Figure 1. Coordinate curves and tangent vectors.

X
n  
si xj ; ξ ¼ 0; i ¼ 1; …; m: (3)
j¼1

The score satisfies


E½si ðx; ξÞ ¼ 0; (4)

where E is the expectation with respect to pðx; ξÞ. The score is a vector representing how the log
likelihood changes as ξ changes from ξ to ξ þ dξ,
X
d log pðx; ξÞ ¼ si ðx; ξÞdξ i : (5)

A component si ðx; ξÞ of the score is regarded as the tangent vector ei of M in the direction of a
coordinate curve ξ i; see Figure 1.
This implies that the geometrical tangent vector

ei ¼ (6)
∂ξ i

is represented by the score si ðx; ξÞ, which is a random variable. The tangent vector dξ ¼ ðdξ i Þ is
the score
X X
dξ ¼ dξ i ei ¼ dξ i si ðx; ξÞ: (7)

When we identify a small change dξ with a small increment of the corresponding log
likelihood
X
d log pðx; ξÞ ¼ dξ i ei ; (8)

we have the random variable representation of tangent vectors,


International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 253

ei ¼ si ðx; ξÞ: (9)


 
We compose a tensor g ¼ gij by
 
gij ðξÞ ¼ hei ; ej i ¼ E si ðx; ξÞsj ðs; ξÞ ; (10)
 
where hei ; ej i is the inner product of two tangent vectors and g ¼ gij is the Fisher information
matrix. Rao (1945) used it as the Riemannian metric tensor of M so that the squared length of a
small change in the parameters dξ is defined by
X
ds2 ¼ gij dξ i dξ j ; (11)

showing that ds2 represents the expected increment in the


 square of the log likelihood. We
further compose a third-order symmetric tensor T ¼ T ijk in a similar way,
 
T ijk ¼ E si ðx; ξÞsj ðx; ξÞsk ðx; ξÞ : (12)

Thus, a statistical manifold M is equipped with two tensors, g and T.


A statistic t(x) is said to be sufficient for ξ, when the probability density of x given ξ is written
in the following form:

pðx; ξÞ ¼ p1 ðt; ξÞp2 ðxÞ; (13)

where p2 does not depend on ξ. Intuitively, t(x) is sufficient for estimating ξ, because the latter
term does not include ξ. We consider another statistical model M0,

M 0 ¼ fpðt; ξÞg; (14)

where by an abuse of notation denoting, we let pðt; ξÞ be the probability density of t given ξ.
Since the score vector in M0 is equal to that in M, M and M0 are geometrically the same. In
particular, a reversible transformation of x to y, y ¼ f ðxÞ is a sufficient statistic. This implies that
the geometrical structures g and T do not depend on the representation of the random variable, x
or f (x).
We further require that the geometrical structure should be constructed in such a way that it is
invariant under the transformation of x to its sufficient statistics. Chentsov (1982) required this
by using Markov morphisms in the discrete case and proved a fundamental theorem that the
invariance defines second-order and third-order symmetric tensors uniquely, which are g and
T given in (10) and (12) up to a common scale.
The manifold equipped with the Fisher information Riemannian metric is not Euclidean, but
curved in general. This is shown by calculating the Riemann–Christoffel curvature, where the
Levi-Civita affine connection

1 
Γ0ijk ¼ ½ij; k ¼ ∂i gjk þ ∂j gik  ∂k gij (15)
2
 
calculated from g ij is used, as is the usual way in Riemannian geometry. However, by using T,
we modify it to give two new invariant affine connections Γ and Γ∗,
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
254 S. AMARI

1
Γijk ¼ ½ij; k  T ijk ; (16)
2
1
Γ∗ijk ¼ ½ij; kþ T ijk : (17)
2

A geodesic ξðtÞ is a trajectory which satisfies


d2 X dξ j dξ k
ξ i ðtÞ þ Γijk ¼0 (18)
dt 2 dt dx
when an affine connection Γ is defined. Since there are two affine connections, we have another
type of geodesic satisfying

d2 X ∗ dξ j dξ k
2
ξ i ðtÞ þ Γ ¼ 0: (19)
dt ijk dt dt

Let
X
X¼ X i ei (20)
P
be a tangent vector defined at ξ. The parallel transport of a tangent vector X ¼ X i ei along a
curve
c: ξ i ðtÞ (21)

is also defined by the vector field X ðtÞ defined on the curve by


P
d i d
X ðtÞþ Γmjk gmi X j ξ k ¼ 0; (22)
dt m; j; k dt

where ðgmi Þ is the inverse of ðg mi Þ; see Figure 2.


We have two parallel transports since we have two affine connections, Γ and Γ∗. They are
X ðtÞ and X ∗ ðtÞ, respectively. The inner product of vectors X and Y is given by
hX i X
hX ; Y i ¼ E X i ei Y j ej ¼ X i Y j gij : (23)

When the inner product of X ðtÞ and X ∗ ðtÞ does not change by the two parallel transports of X,

hX ðtÞ; X ∗ ðtÞi ¼ const:; (24)

the two affine connections are said to be dually coupled. It is straightforward to prove the
following theorem (Amari & Nagaoka, 2000).
Theorem 1. The two affine connections Γ and Γ ∗ are dually coupled.
A statistical manifold is not flat in general. When it is flat, we have a local coordinate system ξ
such that the affine connection vanishes

Γijk ðξÞ ¼ 0: (25)


A geodesic line is linear in these coordinates

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 255

Figure 2. Parallel transport XðtÞ.

ξ i ðtÞ ¼ ci t þ d (26)

for constants ci and d.

Theorem 2. For a statistical manifold M, when it is flat with respect to one affine connection, it is
automatically flat with respect to the dual affine connection.

We now define a divergence and related dual structure. A function D½pðx; ξÞ: pðx; ξ 0 Þ ,
(or D½ξ : ξ 0 ), is called a divergence, when the following criteria are satisfied:

(1) D½ξ : ξ 0  ≥ 0,
(2) D½ξ : ξ 0  ¼ 0, when and only when ξ ¼ ξP
0
,
(3) in the Taylor expansion D½ξ : ξ þ dξ ¼ gij dξ i dξ j
 
for infinitesimally small dξ neglecting higher-order terms, g ij is a positive-definite matrix.
A divergence is said to be invariant, when it is the same when we use sufficient statistic t(x)
instead of x. It is in general asymmetric, D½ξ : ξ 0  ≠ D½ξ 0 : ξ , and is regarded as the square of an
(asymmetric) distance between ξ and ξ 0. Rao proposed a divergence (Burbea & Rao, 1982) and
studied the related Riemannian geometry. He also used a class of divergences to elucidata the
performances of the M-estimators Bai et al. (1982).
A typical example is the Kullback–Leibler (KL-) divergence,

pðx; ξÞ
DKL ½ξ : ξ 0  ¼ pðx; ξÞlog

dx (27)
pðx; ξ 0 Þ
 
and g ij is the Fisher information matrix (hence the Riemannian metric tensor). Another
example is the f-divergence (Csiszár, 1974; Morimoto, 1963) defined by using a convex
function f(u) satisfying
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
256 S. AMARI

f ð1Þ ¼ 0; f 0 ð1Þ ¼ 1 (28)


as

0 pðx; ξ 0 Þ
D½ξ : ξ  ¼ ∫ pðx; ξÞf dx: (29)
pðx; ξ Þ

It includes the α-divergence defined by

4  1 þ α

f α ðuÞ ¼ 1  u 2 : (30)
1  α2

The KL-divergence is its special case of the limit α ¼ 1, f ðuÞ ¼ log u.
A divergence induces a Riemannian metric together with dually coupled affine connections
(Eguchi, 1983). The Riemannian metric is given by

g ij ðξÞ ¼ ∂i ∂j D½ξ : ξ 0 jξ 0 ¼ξ : (31)

The dual pair of affine connections are

Γijk ðξÞ ¼ ∂i ∂j ∂0k D½ξ : ξ 0 ξ 0 ¼ξ ; (32)


Γ∗ijk ðξÞ ¼ ∂k ∂0i ∂0j D½ξ : ξ 0 ξ 0 ¼ξ ; (33)
where


∂0k ¼ : (34)
∂ξ 0k
By replacing ξ and ξ 0 ,

D∗ ½ξ : ξ 0  ¼ D½ξ 0 : ξ  (35)

is called the dual divergence. It gives the same Riemannian metric and the two affine
connections are interchanged. When a divergence is symmetric, it is self-dual, giving a
Riemannian metric with the Levi-Civita self-dual connection.

3 Exponential Family And Dually Flat Manifold


3.1 Exponential family
An exponential family of probability distributions has the following canonical form of
density:

pðx; θÞ ¼ expfθ · x  ψðθÞg (36)


 
with respect to a dominating measure dμðxÞ , where the parameter θ ¼ θ1 ; …; θm is
called the natural or canonical parameter, and x ¼ ðx1 ; …; xm Þ ∈ Rm is a random variable.
We use θ having upper indices θi instead of ξ . A typical example is the family of Gaussian
distributions,

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 257

1 1
pðx; μ; σÞ ¼ pffiffiffiffiffiffi exp  2 ðx  μÞ2 ; (37)
2π σ 2σ

where the mean μ and variance σ 2 are parameters. It can be rewritten in the canonical form (36)
by using
 
x ¼ ðx1 ; x2 Þ ¼ x; x2 ; (38)

μ 1
θ1 ¼ ; θ2 ¼  2 ; (39)
σ 2 2σ

and the dominating measure is defined on x2  x21 ¼ 0.


The function ψðθÞ

ψðθÞ ¼ log expðθ · xÞdμðxÞ;



(40)

corresponds to the normalisation constant. This is the cumulant generating function and called
the free energy in physics. This is a convex function since ∂i ∂j ψðθÞ is positive-definite.

3.2 Geometry of an exponential family


The metric and cubic tensors are defined by (10) and (12). Calculations show that they are
given by

gij ðθÞ ¼ ∂i ∂j ψðθÞ; (41)

T ijk ðθÞ ¼ ∂i ∂j ∂k ψðθÞ: (42)

Moreover, calculations show that the affine connection vanishes, that is,

Γijk ðθÞ ¼ 0 (43)

in the θ coordinate system. We call θ an e-coordinate system, since it originates from the
exponential family. This implies that M is flat and the affine connection vanishes in terms of
θ. M is automatically dually flat. Hence, we have another coordinate system η ¼ ðη1 ; …; ηm Þ
in which the dual affine connection Γ∗ vanishes. It is given by the well-known expectation
parameters,

ηi ¼ E½xi : (44)

We call it an m-coordinate system, because a mixture family of probability distributions are


flat with respect to mixture parameters. From (4), we have

ηi ¼ ∂i ψðθÞ; (45)

showing that η is obtained by the Legendre transformation (Barndorff-Nielsen, 1978). The


Legendre dual of ψðθÞ is given by

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
258 S. AMARI

max
φðηÞ ¼ fθ · η  ψðθÞg (46)
θ

and we have the identity connecting them

ψðθÞþφðηÞ  θ · η ¼ 0; (47)

when η is given by (45).


We use the lower index for denoting the dual affine coordinates η ¼ ðηi Þ. The geometrical
quantities are given as

gij ðηÞ ¼ ∂i ∂j φðηÞ; (48)

T ijk ðηÞ ¼ ∂i ∂j ∂k φðηÞ; (49)



∂i ¼ ; (50)
∂ηi
in terms of the dual coordinate system η, where

∂i ¼ gij ∂j ; (51)
P
and we use the Einstein summation convention that the summation symbol is omitted when
the same index i is used in a term one as super index and the other as lower index. Therefore,
P
∂j ¼ gij ∂ ¼ gij ∂i :
i
(52)
i
 
Here, ðgij Þ is the inverse of gij .
The KL-divergence between two distribution pðx; θÞ and pðx; θ0 Þ of exponential family is

DKL ½θ0 : θ ¼ ψðθÞ  φðη0 Þ  θ · η0 ; (53)

where η0 is the m-coordinates of θ0. We easily see that the dually flat geometrical structures are
derived from the KL-divergence.

3.3 Dually flat manifold


We now review a fundamental theory of a dually flat manifold including an exponential
family, but not necessarily limited to it.

Theorem 3. Fundamental theorem of dually flat manifoldWhen M is dually flat, the following
holds:
(1) There exist two affine coordinate systems θ and η with respect to two dually flat affine
connections and two convex functions ψðθÞ and φðηÞ. The dual affine connections are

Γijk ðθÞ ¼ 0 (54)

in the e-coordinate system and

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 259

Γ∗ijk ðηÞ ¼ 0 (55)

in the m-coordinate system.


(2) The Riemannian metric is given by gij ðθÞ ¼ ∂i ∂j ψðθÞ in the e-coordinate system and
g ij ðηÞ ¼ ∂i ∂j φðηÞ in the m-coordinate system.
(3) The tangent vectors along the coordinate curves, given by

∂ ∂
ei ¼ i; e ¼
i
; (56)
∂θ ∂ηj
are bi-orthogonal, hei ; ej i ¼ δji .
(4) The cubic tensor is given by

T ijk ðθÞ ¼ ∂i ∂j ∂k ψðθÞ; (57)

T ijk ðηÞ ¼ ∂i ∂j ∂k φðηÞ: (58)

(5) There exists a unique divergence called a canonical divergence,

D½θ0 : θ ¼ ψðθÞþφðη0 Þ  θ · η0 ; (59)

where η0 denotes the m-coordinates of θ0 .

The canonical divergence in a dually flat manifold of probability distributions is given by (27)
and is the KL-divergence. The KL-divergence has been used in various fields such as statistics,
information theory and statistical physics as a convenient measure of discrepancy between two
distributions without any explicitly shown reasons. The present theory gives a justification for
the KL-divergence, because it is the canonical divergence of a dually flat manifold. There are
many divergences including one proposed by Rea (Burbea & Rao, 1982). They are not
invariant, having the Riemannian metrics different from the Fisher information.
We remark that there are dually flat statistical manifolds other than the exponential family. A
mixture family of probability distributions given by
( )
Xm X
M ¼ pðx; ηÞ ¼ ηi pi ðxÞ: ηi ¼ 1 ; (60)
i¼0

where p0(x), … , pm(x) are linearly independent prescribed probability distributions. A mixture
family is not exponential family in general but is dually flat. Banerjee et al. (2005) showed that,
under a certain regularity condition, a dually flat manifold M is related to an exponential family
provided the inverse Laplace transform of ψðθÞ exists.
Given a convex function ψðθÞ, we can introduce a dually flat geometrical structure in which θ
is a flat coordinate system and its dual is given by the Legendre transform

η ¼ ∂i ψðθÞ: (61)

The canonical divergence is rewritten as


International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
260 S. AMARI

D½θ0 : θ ¼ ψðθÞ  ψ ðθ0 Þ  ∇ψ ðθ0 Þ ·ðθ  θ0 Þ: (62)

and its dual is

D∗ ½θ: θ0  ¼ D½θ0 : θ ¼ ψ ðθ0 Þ  ψ ðθÞ  ∇ψ ðθÞ ·ðθ0  θÞ: (63)

This is called the Bregman divergence derived from a convex function.

3.4 Pythagorean theorem and projection theorem


A dually flat manifold M has a nice properties. Given three points P, Q, R ∈ M, we connect P
and Q by the m-geodesic and Q and R by the e-geodesic. Let θP ; θQ ; θR be their e-coordinates
and ηP ; ηQ ; ηR be their m-coordinates. Then, the e-geodesic connecting Q and R is

θðtÞ ¼ ð1  tÞθQ þ tθR ; (64)

and its tangent vector is given by

θ_ ¼ θR  θQ : (65)

The m-geodesic connecting P and Q is given by

ηðtÞ ¼ ð1  tÞηP þ tηQ (66)

and its tangent vector is given by

_
ηðtÞ ¼ ηQ  ηP : (67)

Theorem 4. Pythagorean theorem (Figure 3). When the m-geodesic connecting P and Q is
orthogonal to the e-geodesic connecting Q and R,

D½P: R ¼ D½P: QþD½Q: R; (68)

where D is the canonical divergence. Dually, when the e-geodesic connecting P and Q is
orthogonal to the m-geodesic connecting Q and R

D∗ ½P: R ¼ D∗ ½P: QþD∗ ½Q: R: (69)

The projection theorem follows directly from the Pythagorean theorem. Let S be a smooth
submanifold in a dually flat manifold M. Let P be a point outside S. We search for the minimiser
P^ ∈ S of D[P : Q], Q ∈ S, or dually the minimiser of D½Q: P ¼ D∗ ½P: Q.

Theorem 5. Projection theorem (Figure 4). The minimiser of D[P : Q], Q ∈ S is obtained by
m-projecting P to S such that m-geodesic connecting P and P^ is orthogonal to S.

Dually, the minimiser of D∗[P : Q] is obtained by e-projecting P to S such that the e-geodesic
connecting P and P^ ∗ is orthogonal to S. The m-projection is unique, when S is e-flat and the

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 261

e-projection is unique, when S is m-flat. The proof is immediate from the Pythagorean theorem.
The two theorems are useful in various applications (Amari, 2016).

3.5 Statistical inference in a curved exponential family


Let D ¼ fx1 ; …; xn g be n independently observed data from an exponential family. The
MLE η^ is simply given by the arithmetic mean of the data,

Figure 3. Pythagorean theorem.

Figure 4. Projection theorem.

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
262 S. AMARI

1X
η^ ¼ xi (70)
n

in terms of the dual parameter η. We call the point η^ ∈ M the observed point given directly from
the data D. The empirical distribution of D does not belong to M but the observed point lies in
M. The observed point converges to the true point as n tends to infinity.
We now consider statistical inference in a curved exponential family qðx; uÞ specified by
parameters u ¼ ðua Þ ¼ ðu1 ; …; ur Þ , r < m. It is a submodel embedded in M, called an
(m, r)-exponential family.

S ¼ fqðx; uÞj qðx; uÞg ¼ exp½θðuÞ · x  ψ fθðuÞg: (71)

Estimation is regarded as a projection of the observed point η^ to S ⊂ M. The MLE u^ is given


by the minimiser of the log likelihood, or equivalently the KL-divergence from η^ to S,

arg min
u^ ¼ DKL ½η: S: (72)
S

Because of the Pythagorean theorem, it is given by the m-projection of the observed point η^ to
S. It is Fisher efficient, and the asymptotic variance of the MLE is given by

1
Cov½u^ ¼ Bia gij Bjb (73)
n

where matrix B is

∂θi ðuÞ
Bia ¼ : (74)
∂ua

The matrix

gab ¼ Bia Bjb gij (75)

is the Fisher information matrix of model S.


Higher-order theory of statistical estimation was initiated by Rao (Rao, 1962; Rao
et al., 1982). It is possible to study higher-order asymptotics of an estimator in terms of
geometry, where the e- and m-curvatures play a fundamental role (see, e.g. Amari, 1985;
2016; Amari & Nagaoka, 2000). But it is complicated, so we show only results.

Theorem 6. An estimator is efficient when it is given by the orthogonal projection of η^ to S. It is


higher-order efficient when the orthogonal projection is m-flat. The higher-order asymptotic error
covariance is decomposed in a sum of the m-curvature of the projection trajectory, the e-curvature
of the model S, and a connection term.

For higher-order asymptotic theory of statistical testing; see Kumon & Amari (1983) and
Amari & Nagaoka (2000).

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 263

4 Geometry of Estimation Functions in Semiparametric Statistical Models: The


Neyman–Scott Problem
4.1 Semiparametric model and nuisance parameter
A semiparametric model

S ¼ fpðx; uÞg (76)

is specified by a parameter of interest u, but the form of p is unknown, and it has infinite degrees
of freedom. Here, we assume that u is a scalar, but it is easy to generalise to the vector case.
A simple example is a location model,

S ¼ fpðx; uÞg; (77)

where p is an arbitrary smooth function having moments and satisfying


∫ pðxÞdx ¼ 1; ∫ xpðxÞdx ¼ 0: (78)

We study here the mixture-type semiparametric model to describe the Neyman–Scott


problem. To this end, we define a regular model

Q ¼ fqðx; u; ξÞg (79)

having two types of parameters u and ξ. Here, u is the parameter of interest and ξ is the nuisance
parameter which we do not care. Let D ¼ fx1 ; …; xn g be n independent observations, where xi
is an observation from qðx; u; ξ i Þ . Here, the parameter of interest u is common for all
observations but ξ i may differ each time. This model is proposed by Neyman & Scott (1948),
in which they showed that the MLE is not necessarily consistent nor efficient. Then, what is
the best estimator in this situation in the asymptotic sense? The Neyman–Scott problem had
bothered statisticians for many years until Bickel et al. (1994) and Amari & Kawanabe (1997)
presented convincing theories.
It is convenient to assume that unknown values ξ i of the nuisance parameter are randomly
generated from an unknown probability distribution k(ξ). Then, each xi is regarded as an iid
random variable generated from a distribution belonging to the model

S M ¼ fpðx; u; kÞg; (80)

pfx; u; k g ¼ qðx; u; ξÞkðξÞdξ:



(81)

SM includes the nuisance parameter k(ξ) of function degrees of freedom and is called a
semiparametric model (Begun et al., 1983).
SM has function degrees of freedom, so we need to treat a function space of probability
distributions. We do not present a rigorous geometrical theory, which has not yet been
completed (Ay et al., 2017; Pistone & Sempi, 1995). Instead, we give an intuitive arguments,
without specifying the conditions under which our theory holds. But the theory is useful for
practical applications.
Let us consider the tangent space of SM at point (u, k). Since a small deviation in log
probability
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
264 S. AMARI

wðxÞ ¼ δ log pðx; u; kÞ (82)

satisfies

E½wðxÞ ¼ 0; (83)

we consider

T u; k ¼ fwðxÞj E½wðxÞ ¼ 0g; (84)

where E is expectation with respect to pðx; u; kÞ, as the tangent space of SM at (u, k), provided
h i
E fwðxÞg2 < ∞ (85)

is satisfied. Then the tangent vectors wðxÞ form a Hilbert space. The inner product of two
tangent vectors w1 ðxÞ and w2 ðxÞ is given by

hw1 ðxÞ; w2 ðxÞi ¼ E½w1 ðxÞw2 ðxÞ: (86)

It includes a one-dimensional subspace



d
Tu ¼ log pðx; u; kÞ ; (87)
du

which represents a change in the direction of u. It also includes



d
Tk ¼ log pðx; u; cðtÞÞ ; (88)
dt

where k = c(t) is a curve in SM passing through (u, k) at t ¼ 0. This is the nuisance tangent space
of infinite dimensions. The above two do not cover entire Tu, k, and it includes other vectors aðxÞ
which are orthogonal to both Tu and Tk. They form a subspace called auxiliary tangent subspace
Ta. The tangent space is decomposed in a direct sum,

T u; k ¼ T u ⊕ T k ⊕ T a ; (89)

where Ta is orthogonal to Tu ⊕ Tk.

4.2 Estimating functions


A function f ðx; uÞ is called an estimating function (Godambe, 1991), when it satisfies

Eu; ξ ½ f ðx; uÞ ¼ 0; (90)



d
A ¼ Euξ f ðx; uÞ > 0; (91)
du

where Eu, ξ represents expectation with respect to qðx; u; ξÞ for any ξ. Note that (90) and (91)
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 265

hold for expectation with respect to any pðx; u; kÞ, which is a linear combination of qðx; u; ξÞ s.
An estimator u^ is obtained from
X
f ðxi ; uÞ ¼ 0; (92)

where the expectation in (90) is replaced by the arithmetic sum. It is easy to prove that the
estimator is consistent and asymptotically Gaussian having the variance
h i
2
1 E f f ðx; uÞg
V½u^ ¼ ; (93)
n A2

in a similar way as we prove that the MLE is asymptotically Gaussian (Lindsay, 1982).
We need to show when an estimating function exists and which is better when there are many.
In order to characterise the estimation function, we use parallel transports of tangent vectors.
The e  andm-transports of a tangent vector wðxÞ from ðu; k 1 Þ to ðu; k 2 Þ are defined,
respectively, by
e
∏kk 21 wðxÞ ¼ wðxÞ  Eu; k 2 ½wðxÞ; (94)

m pðx; u; k 1 Þ
∏kk 21 wðxÞ ¼ wðxÞ: (95)
pðx; u; k 2 Þ

It is easy to see that the parallel transports keep the inner product invariant in the following
sense:
e m
hw1 ðxÞ; w2 ðxÞiu; k 1 ¼ h∏kk 21 w1 ðxÞ; ∏kk 21 w2 ðxÞiu; k 2 : (96)

Theorem 7. An estimating function f ðx; uÞ is orthogonal to the nuisance tangent space,

hf ðx; uÞ; vðx; u; ξÞi ¼ 0; (97)


d
vðx; u; ξÞ ¼ log qðx; u; ξÞ (98)

and is invariant under the e-parallel transport from any ðu; k 1 Þ to ðu; k 2 Þ,
e
∏kk 21 f ðx; uÞ ¼ f ðx; uÞ: (99)

We omit the proof, because it is easily obtained from the definition.


Let

d
sðx; u; kÞ ¼ log pðx; u; kÞ (100)
du

be the score vector and let sI ðx; u; kÞ be its projection to the subspace orthogonal to Tk. We call it an
information score. We then have the following theorem.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
266 S. AMARI

Theorem 8. An estimating function is given by the following sum:

f ðx; uÞ ¼ sI ðx; uÞþaðxÞ; aðxÞ ∈ T a : (101)

This proves that an estimating function exists when the information score is non-vanishing.
The optimal estimating function is sI ðx; u; kÞ, which we do not know since k is unknown.
However, this gives good means of selecting an estimating function. Even when we choose k
incorrectly, it is still an estimating function. When the true nuisance function is k0, the
optimal estimating function is sI ðx; u; k 0 Þ. An estimating function using the incorrect k is
written as

sI ðx; u; kÞ ¼ sI ðx; u; k 0 Þ þ aðxÞ; (102)

by using some auxiliary aðxÞ, and it still gives a consistent estimator. This is not true when we use
statistical model p(x, u, k) for estimating u with the incorrect k.

4.3 Estimation of rate of proportionality in a linear model


We focus on a specific example of Neyman–Scott problem, although we can analyse other
problems similarly. Let u be the parameter of interest, which represents the ratio of weight to
volume of sample material,
w
u¼ ; (103)
v

where v is the volume and w is its weight of a specimen. We assume we have the following
observations ðxi ; yi Þ, where xi are noisy observations of volumes and yi are noisy observations
of the weights of various specimens. They are given by

xi ¼ ξ i þ ε i ; (104)

yi ¼ uξ i þ ε0i ; (105)

where εi and εi0 are subject to a Gaussian distribution with mean 0 and variance σ 2
independently.
The joint distribution of x ¼ ðx; yÞ is written as


1 1 h 2 2
i
qðx; y; u; ξÞ ¼ exp  ðx  ξÞ þ ðy  uξÞ : (106)
2πσ 2 2σ 2

We rewrite (106) as

qðx; y; u; ξÞ ¼ expfξsðx; y; uÞþrðx; yÞ  ψðu; ξÞg; (107)

where we put
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 267

sðx; y; uÞ ¼ x þ uy; (108)

1 2 
rðx; yÞ ¼  x þ y2 : (109)
2

Since ξ is a random variable subject to k(ξ) in SM, we can study the conditional distribution of
ξ given s in (106). This is given by

kðξÞexpfξs þ r  ψ g
pðξjsÞ ¼ : (110)
pðx; u; kÞ

We have

E½ξjs ¼ hðuy þ xÞ; (111)

where h is a function depending on k. Using this, we have an explicit form of the nuisance
tangent space Tk. It is spanned by s irrespective of k,

T k ¼ fsðx; y; uÞg: (112)

Hence, it is invariant under the parallel transport from k1 to k2. We further calculate the
information score, obtaining

sI ðx; y; uÞ ¼ ðy  uxÞE½ξjs: (113)

This proves that an estimation function is of the form,

f ðx; y; uÞ ¼ ðy  uxÞhðuy þ xÞ; (114)

where h(z) is an arbitrary function. The best function h depends on the unknown k.
There are a number of estimators in this specific problem. We show some of them in the
following:

1) Least square estimator, which minimises


1X
L¼ ðui  uxi Þ2 (115)
2

and the solution is simply given by


P
y xi
u^ ¼ P i 2 : (116)
xi

2) MLE or the total least square

This is the estimator maximising log likelihood of all observations with respect to all the
unknown parameters u and ξ 1, … , ξ n. This minimises the sum of squares of the lengths of lines
that project the observed ðxi ; yi Þ to the regression line

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
268 S. AMARI

y ¼ ux: (117)

The solution is given by solving


X
ðyi  uxi Þðuyi þ xi Þ ¼ 0: (118)

3) Ratio of the sum of total weight to the sum of total volume, simply given by
P
y
u^ ¼ P i : (119)
xi

The least square solution is not good in this case, since it does not give even a consistent
estimator. Both MLE and the ratio estimators are consistent. Which is better? It depends on
the distribution k(ξ). Roughly speaking, when the average of k(ξ) is much larger than its
standard deviation, MLE is better, but when it is smaller, the ratio estimator is better.
Amari & Kawanabe (1997) proposed an estimator of which estimating function includes a
parameter c,

f ðx; y; uÞ ¼ ðy  uxÞðuy þ x þ cÞ: (120)

When c ¼ 0, it gives MLE. When c → ∞, it gives the ratio estimator. They proposed a method
of determining c from observed data.

5 Hierarchical Model and Higher-Order Interactions


Many statistical models have hierarchical structures such that lower-order submodels are
included in higher-order submodels. Markov chains consisting of various orders are such
models in which a k-th order submodel is included in a (k + 1)th order submodel. An
autoregression (AR) model in time series is another example, in which a k-th order submodel
is included in a (k + 1)-th order submodel.
We study a hierarchical structure of an exponential family Sn consisting of
" #
X
n n o
pðx; θÞ ¼ exp θðiÞ · xðiÞ  ψ θð1Þ ; …; θðnÞ ; (121)
i¼1

   
where x ¼ xð1Þ ; …; xðnÞ , θ ¼ θð1Þ ; …; θðnÞ . The corresponding η -coordinates are η ¼
 
ηð1Þ ; …; ηðnÞ . Let Sk be a submodel defined by
n  o
S k ¼ pðx; θÞθðk þ 1Þ ¼ … ¼ θðnÞ ¼ 0 : (122)

Then, we have a hierarchical structure S1 ⊂ S2 … ⊂Sn.


   
We divide the θ coordinates into two parts, θ ¼ Θk ; Θk þ 1 , Θk ¼ θð1Þ ; …; θðkÞ , Θk þ 1 ¼
 ðk þ 1Þ 
θ ; …; θðnÞ , which is called a k-cut. We also divide the η coordinates as η ¼
     
H k ; H k þ 1 , H k ¼ ηð1Þ ; …; ηðkÞ , H k þ 1 ¼ ηðk þ 1Þ ; …; ηðnÞ . Since θi is orthogonal to
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 269

ηj (j ≠ i), or more precisely the tangent vector ei along the coordinate θi is orthogonal to
the tangent vector ej along the coordinate ηj (j ≠ i), parameters Θk + 1 are orthogonal to
parameters Hk.  
It is convenient to define a new coordinate system, Ξ ¼ H k ; Θk þ 1 , which consists of the
lower part of η coordinates and the higher part of θ coordinates. We call it a mixed coordinate
system (Amari, 2001). Let us consider a submanifold specified by

 
H k ¼ cð1Þ ; …; cðkÞ (123)

for fixed vectors cð1Þ ; …; cðkÞ . Complementary, we consider a submanifold specified by
Θk þ 1 ¼ d ðk þ 1Þ ; … ; d ðnÞ for fixed vectors d ðk þ 1Þ ; …; d ðnÞ . These two submanifolds are
orthogonal to each other for any c and d. Note that two submaifolds Hk and Hk + 1 or Θk and
Θk + 1 are not orthogonal.
We consider a simple binary model for x ¼ ðx1 ; …; xn Þ, each xi taking 0 or 1. Given an
observed contingency table T ðx1 ; x2 ; …; xn Þ, we have interest in knowing how xis are mutually
interrelated in probability distribution pðx1 ; … ; xn Þ. There are pairwise interactions, three-way
interactions and further higher-order interactions among variables x1, … , xn. How can we
quantify these interactions? The mixture coordinates and the k-cuts are useful for defining the
degrees of interactions among random variables. We show that the interaction terms can be
decomposed orthogonally into pairwise, triplewise and higher-order ones.
We use a neural network of binary neurons as a typical example for intuitive explanation. The
network consists of n connected neurons interacting one another. Each neuron takes two states,
excited or quiescent. Let x1, … , xn be n binary random variables, where xi ¼ 1 represents that
the i-th neuron is excited and xi ¼ 0 quiescent. The current state x ¼ ðx1 ; …; xn Þ is regarded as
a vector random variable, and its joint probability distribution p(x1, … , xn) is shown in the form
of an exponential family,

nX X o
pðx; θÞ ¼ exp θ i xi þ θij xi xj þ ⋯þθ12⋯n x1 ⋯xn  ψðθÞ : (124)

   
Its e-coordinates are organised as θ ¼ Θ1 ; Θ2 ; …; Θn , where subvectors Θk ¼ θi1 …ik ;
i1 < … < ik , summarise the k-th order terms.  
The corresponding m coordinates are H k ¼ ηi1 …ik , where ηi1 …ik ¼ E½xi1 …xik  . η is
decomposed as

η ¼ ðH 1 ; … ; H n Þ: (125)

The subvector Hk represents the probability of k neurons jointly firing, that is, the probability
of xi1 xi2 …xik ¼ 1. It is important to see that the coordinate axes of Hk are orthogonal to those of
Θk0 when k0 > k. A submodel
  
S k ¼ θΘk þ 1 ¼ … ¼ Θn ¼ 0 (126)

is called a k-th order model. We have a hierarchical structure S0 ⊂ S1 ⊂ … ⊂ Sn, where S0 is a


single point specified θ ¼ 0.

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
270 S. AMARI

We begin with a simple model of two neurons, n ¼ 2. The probability distribution of x is


 
pðx; θÞ ¼ exp θ1 x1 þ θ2 x2 þ θ12 x1 x2  ψðθÞ : (127)

When θ12 ¼ 0, two neurons fire independently, pðxÞ ¼ p1 ðx1 Þp2 ðx2 Þ and there are no mutual
interactions between the two. When two neurons are not independent, there is mutual
interaction. We want to know its degree. The covariance σ 2 ¼ E½x1 x2   E½x1 E½x2  is 0, when
the two are independent, which clearly shows a degree of mutual interaction. However, there are
many such quantities that vanish when the two neurons fire independently. The increase in log
likelihood, due to an increase in firing rates, is correlated to the increase of log likelihood due to
the covariance. So the firing rates and the covariance are not separated. Geometrically speaking,
we want to have such a quantity that the direction due to an increase in interaction is orthogonal
to the directions due to increases in firing rates. This quantity represents the degree of
interaction that does not change even when the firing rates change. The covariance is not such
a quantity.
We know that the e-coordinate θ12 is orthogonal to firing rates η1 and η2, so it represents the
interaction orthogonal to the firing rates. As is seen in Figure 5, S1 defined by θ12 ¼ 0 is an e-flat
submanifold consisting of all the independent distributions.
Given a general distribution, we m-project it to S1. Then, we have a distribution in S1 which
has the same firing rates but no interaction. The submanifold M1 defined by η1 ¼ c1 ; η2 ¼ c2,
for constants c1 and c2, is m-flat, consisting of the distributions having the same firing rates c1
and c2 but the degree of interaction θ12 is different. They are orthogonal intersecting at ηi ¼ ci,
θ12 ¼ 0.
Let p0 be the distribution specified by θ ¼ 0, that is η1 ¼ η2 ¼ 1=2 and θ12 ¼ 0 . The
KL-divergence from p to p0 is decomposed as

DKL ½p: p0  ¼ DKL ½p: p^ þ DKL ½p^: p0 ; p^ ∈ S 1 ∩M 1 : (128)

The first term shows the effect of deviations of firing rates from 1/2, that is, the
KL-divergence from p^ to p0. The second represents the effect of interaction shown by
the KL-divergence from p to the independent model p^.

Figure 5. Orthogonal decomposition.

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 271

We next consider the case of n ¼ 3 neurons. It has a hierarchical structure S1 ⊂ S2 ⊂ S3. The
S1 is an independent model,

S 1 ¼ fpðxÞjpðxÞ ¼ p1 ðx1 Þp2 ðx2 Þp3 ðxÞg (129)

and ηis show the firing rates of neurons. Since directions of θ ij, θ123 are orthogonal to the
directions of ηi, they together represent degrees of neural interactions
n  o to the firing
orthogonal

rates. Further, consider submanifold specified by M 2 ¼ η ηi ¼ ci ; ηij ¼ cij , for fixed ci
 
and cij. The pair ηi ; ηij represent both the firing rates and pairwise joint firing rates, but do
not specify the triple firing rate, η123 ¼ E½x1 x2 x3  . The quantity that is orthogonal to both
the firing rates and pairwise firing rates is given by θ123. So,
p111 p100 p010 p001
θ123 ¼ log (130)
p110 p101 p011 p000

represents the triplewise interaction orthogonal to the firing and pairwise firing rates.
A distribution p is m-projected to S2 and S1, where the KL-divergence is decomposed as

DKL ½p: p0  ¼ DKL ½p: p^2  þ DKL ½p^2 : p^1  þ DKL ½p^1 : p^0  (131)

by the Pythagorean theorem.


In a general model of n neurons, we consider the mixed coordinates of the k-cut
 
ξ k ¼ ηi ; … ; ηi1 …ik ; θi1 …ik þ 1 ; …; θ1…n : (132)
 
Hk represent the joint firing rates up to order k and Θk ¼ θi; …ik þ 1 ; … ; θ1…n represents
degrees of mutual interactions of orders higher than k orthogonal to Hk. Therefore, θi1 …ik
represent the degrees of (k + 1)-th order interactions. The KL-divergence from p to the origin
is decomposed as
nX
1  
DKL ½p: p0  ¼ DKL p^k þ 1 : p^k ; (133)
k¼0

 
where p^n ¼ p, p^k is the projection of p to Sk and p^0 ¼ p0 . This shows that DKL p^k þ 1 : p^k
represents the effect of the order k interactions. We show only binary models, but we can
generalise the theory to any hierarchical dually flat models.

6 Conclusions
Since Rao’s proposal three quarters of a century ago, information geometry, the geometrical
theory of a manifold of probability distributions has been developed widely, giving useful tools
for various fields related to probability. This paper reviews only some of interesting structures to
which information geometry gives remarkable contributions, such as dually flat structures and
discusses some applications.
There are some new developments in this area. Wong (2018) gives a new mathematical idea,
which generalises the Legendre transformation, and presents a new theory applicable to
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
272 S. AMARI

projectively flat manifolds. The Pythagorean and projection theorems hold, where the Rényi
divergence plays the role of the canonical divergence in a dually projectively flat manifold.
The Wasserstein distance gives another type of divergence in a manifold of probability
distributions and has broad applicability. The distance is responsible for the metric structure
of the base space on which distributions are defined. However, it is not invariant. It is interesting
to find a theory connecting them (Amari et al., 2019). Li & Zhou (2019) proposed a fundamen-
tal theory unifying the two geometries. See Amari & Matsuda (2021) for more detail on
Wasserstain statistics. Amari (2016) discusses various applications.

References
Amari, S. 1982. Differential geometry of curved exponential families—curvature and information loss. Ann. Stat., 10,
357–385.
Amari, S. 1985. Differential-Geometrical Methods in Statistics, Lecture Notes in Statistics, Vol. 28. Springer.
Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory, 47,
1701–1711.
Amari, S. 2016. Information Geometry and Its Applications. Springer.
Amari, S., Karakida, R., Oizumi, M. & Cuturi, M. 2019. Information geometry for regularized optimal transport and
barycenters of patterns. Neural Comput., 31, 827–848.
Amari, S. & Kawanabe, M. 1997. Information geometry of estimating functions in semi-parametric statistical models.
Bernoulli, 3, 29–54.
Amari, S. & Matsuda, T. 2021. Wasserstein statistics in 1D location-scale model. Annals of Institute of Statistical
Mathematics.
Amari, S. & Nagaoka, H. 2000. Methods of information geometry. American Mathematical Society and Oxford
University Press.
Ay, N., Jost, J., Lé, H.V. & Schwanchhöfer, L. 2017. Information Geometry. Springer.
Bai, Z.D., Rao, C.R. & Wu, Y. 1982. M-estimation of multivariate linear regression parameters under a convex
discrepancy function. Stat. Sin., 2, 237–264.
Banerjee, A., Merugu, S., Dhillon, I. & Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res.,
6, 1705–1749.
Barndorff-Nielsen, O.E. 1978. Information and Exponential Families in Statistical Theory. Wiley.
Begun, J.M., Hall, W.J., Huang, W.M. & Wellner, J.A. 1983. Information and asymptotic efficiency in
parametric-nonparametric models. Ann. Stat., 11, 432–452.
Bickel, P.J., Ritov, C.A.J. & Wellner, J.A. 1994. Efficient and Adaptive Estimation for Semiparametric Models. Johns
Hopkins University Press.
Burbea, J. & Rao, C.R. 1982. Entropy differential metric, distance and divergence measures in probability spaces: a
unified approach. J. Multivar. Anal., 12, 575–596.
Chentsov, N.N. 1982. Statistical decision rules and optimal inference. Nauka, 1972 (in Russian; English Translation,
AMS.
Cramér, H. 1946. Mathematical Methods of Statistics. Princeton University Press.
Csiszár, I. 1974. Information measures: a critical survey, pp. 83–86, Proc. 7th Conf. Inf. Theory, Prague, Czech
Republic.
Dawid, A.P 1975. (Discussions to B. Efron).
Efron, B. 1975. Defining the curvature of a statistical problem (with application to second order efficiency). Ann. Stat.,
3, 1189–1242.
Eguchi, S. 1983. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.,
11, 793–803.
Godambe, V.P. 1991. Estimating Functions. Oxford University Press.
Kumon, K. & Amari, S. 1983. Geometrical theory of higher-order asymptotics of test, interval estimator and condi-
tional inference. Proc. Royal Society of London, A, 387, 429–458.
Li, W. & Zhou, W. 2019. Wasserstein information matrix. arXiv.
Lindsay, B. 1982. Conditional score functions: some optimality results. Biometrika, 69, 503–512.
Morimoto, T. 1963. Markov processes and the H-theorem. J. Phys. Soc Jap., 12, 328–331.
Nagaoka, H. & Amari, S. 1982. Differential geometry of smooth families of probability distributions. In METR 82-7,
UTokyo.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 273

Neyman, J. & Scott, E.L. 1948. Consistent estimates based on partially consistent observation. Econometrika, 16,
1–32.
Pistone, G. & Sempi, C. 1995. An infinite-dimensional geometric structure on the space of all the probability
measures equivalent to a given one. Ann. Stat., 23, 1543–1561.
Rao, C.R. 1945. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math.
Soc., 37, 81–91.
Rao, C.R. 1962. Efficient estimators and optimum inference in large samples. J. R. Stat. Soc. B., 24, 46–72.
Rao, C.R., Sinha, B.K. & Subramanyam, K. 1982. Third order efficiency of the maximum likelihood estimator in the
multinomial distributions. Stat. Decisions, 1, 1–16.
Wong, T.-K.L. 2018. Logarithmic divergence from optimal transport and Renyi geometry. Inf. Geom., 1, 39–78.

[Received October 2020; accepted June 2021]

International Statistical Review (2021), 89, 2, 250–273


© 2021 International Statistical Institute.

You might also like