0% found this document useful (0 votes)

51 views24 pages

Int Statistical Rev - 2021 - Amari - Information Geometry

The document reviews the field of Information Geometry, which studies the geometric structure of statistical models represented as manifolds of probability distributions. Initiated by Professor C. R. Rao in 1945, it highlights the significance of the Fisher information matrix as a metric tensor and explores various applications across statistics, AI, and other sciences. The paper also discusses the development of higher-order asymptotic theories and dual geometries in statistical inference, celebrating Rao's contributions to the field.

Uploaded by

markcguinto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views24 pages

Int Statistical Rev - 2021 - Amari - Information Geometry

Uploaded by

markcguinto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

International Statistical Review (2021), 89, 2, 250–273 doi: 10.1111/insr.

12464

Information Geometry
Shun-ichi Amari1,2
1
ACRO, Teikyo University, Tokyo, Japan
2
RIKEN Center for Brain Science, Saitama, Japan
E-mail: [email protected]

Summary
Statistical inference is constructed upon a statistical model consisting of a parameterised family
of probability distributions, which forms a manifold. It is important to study the geometry of the
manifold. It was Professor C. R. Rao who initiated information geometry in his monumental paper
published in 1945. It not only included fundamentals of statistical inference such as the
Cramér–Rao theorem and Rao–Blackwell theorem but also proposed differential geometry of a
manifold of probability distributions. It is a Riemannian manifold where Fisher–Rao information
plays the role of the metric tensor. It took decades for the importance of the geometrical structure
to be recognised. The present article reviews the structure of the manifold of probability distribu-
tions and its applications and shows how the original idea of Professor Rao has been developed and
popularised in the wide sense of statistical sciences including AI, signal processing, physical sciences
and others.

Key words: information geometry; Fisher–Rao information; dual affine connections; generalised
Pythagorean theorem.

1 Introduction
Information geometry studies the structure of a regular statistical model, which forms a
manifold M. It consists of probability distributions parameterised by an m-dimensional vector,
where the parameters constitute a coordinate system. It is a natural question to search how
different two distributions in M are, that is, a distance or divergence between two distributions.
It was Professor Rao’s monumental paper (Rao, 1945) that answered the question. He
proposed a fundamental theory to show that M is a Riemannian manifold, where the Fisher
information matrix plays the role of the Riemannian metric. At the same time, Rao presented
a fundamental theorem in statistics, which was later called the Cramér–Rao theorem, because
it was also independently proved by Cramér (1946). Rao calculated the Riemannian distance
between two Gaussian distributions with different means and variances. The present review
summarises Information Geometry initiated by C. R. Rao to show how Rao’s idea has been
developed.
The geometrical approach was so fundamental that it took decades for researchers to develop
his idea. Chentsov (1982) followed the idea and answered the question of why the Fisher
information should be used, by proposing an invariance criterion. The Fisher information matrix
which Rao used is unique from the point of view of invariance under Markovian morphisms of
statistical models. He further showed that a third-order symmetric tensor exists and is also
unique. These two tensors together deﬁne invariant afﬁne connections to be introduced in M.
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 251

A higher-order asymptotic theory of statistical inference was proposed by Rao (1962),

where he showed that the maximum-likelihood estimator (MLE) keeps the maximum amount
of information up to higher orders. When the number of observations, n, is large, it was
known that the MLE is an optimal estimator, by expanding the estimation error in series of
O(1/n), where the ‘higher order’ implies terms of O(1/n2). By the early 1980s, higher-order
asymptotic theories had been developed widely. Efron (1975) proposed a geometrical theory
that the higher-order quantities are related to statistical curvatures, by developing the idea of
Fisher’s unpublished research memorandum. Dawid (1975) supplemented the idea of Efron
by proposing the mixture affine connection, whereas Efron used the exponential affine
connection. Both belong to the class of invariant affine connections proposed by
Chentsov (1982).
Following the ideas of Efron (1975) and Dawid (1975), Amari (1982) developed a
geometrical theory of higher-order asymptotics, where two types of statistical curvatures play
a fundamental role. The geometrical structure is the same as that of Chentsov (1982). Nagaoka
and Amari (1982) further developed a dual theory of information geometry, where the two
affine connections are shown to be dually coupled. In particular, the geometry of an exponential
family of probability distributions is shown to be a dually flat Riemannian manifold. Such a
manifold has nice geometrical properties; for instance, the generalised Pythagorean theorem
and dual projection theorems hold. These properties are very useful for applications to various
problems, not only in statistics but in machine learning, signal processing, game theory, physics
and so on (see Amari, 2016).
The present paper celebrating Professor Rao’s 100th anniversary reviews the geometry of
manifolds of probability distributions initiated by C. R. Rao and suggests some useful
applications. Section 2 defines a statistical manifold and dual geometry composed on it.
Section 3 focusses on a dually flat manifold based on exponential family of distributions.
Section 4 touches upon estimation functions in a semiparametric statistical model. A solution
is given to the Neyman–Scott problem. Section 5 deals with hierarchical statistical models.
Finally, Section 6 provides conclusions.

2 Statistical Manifold and Dual Geometry

Let

M ¼ fpðx; ξÞg (1)

be a regular statistical model parameterised by an m-dimensional parameter vector ξ ¼

ðξ 1 ; …; ξ m Þ, where pðx; ξÞ is the probability density function of a random variable x. M is
an m-dimensional manifold, where ξ is a (local) coordinate system.
The score function sðx; ξÞ is an m-dimensional vector whose components are deﬁned by the
derivatives of log probability,

si ðx; ξÞ ¼ ∂i log pðx; ξÞ; i ¼ 1; 2; …; m; (2)

∂
where ∂i ¼ .
∂ξ i
This is a fundamental quantity in statistics. For n independent observations x1, … , xn, the
MLE ξ^ is the maximiser of the log likelihood and is obtained as the solution of the likelihood
equations
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
252 S. AMARI

Figure 1. Coordinate curves and tangent vectors.

X
n
si xj ; ξ ¼ 0; i ¼ 1; …; m: (3)
j¼1

The score satisﬁes

E½si ðx; ξÞ ¼ 0; (4)

where E is the expectation with respect to pðx; ξÞ. The score is a vector representing how the log
likelihood changes as ξ changes from ξ to ξ þ dξ,
X
d log pðx; ξÞ ¼ si ðx; ξÞdξ i : (5)

A component si ðx; ξÞ of the score is regarded as the tangent vector ei of M in the direction of a
coordinate curve ξ i; see Figure 1.
This implies that the geometrical tangent vector
∂
ei ¼ (6)
∂ξ i

is represented by the score si ðx; ξÞ, which is a random variable. The tangent vector dξ ¼ ðdξ i Þ is
the score
X X
dξ ¼ dξ i ei ¼ dξ i si ðx; ξÞ: (7)

When we identify a small change dξ with a small increment of the corresponding log
likelihood
X
d log pðx; ξÞ ¼ dξ i ei ; (8)

we have the random variable representation of tangent vectors,

International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 253

ei ¼ si ðx; ξÞ: (9)

We compose a tensor g ¼ gij by

gij ðξÞ ¼ hei ; ej i ¼ E si ðx; ξÞsj ðs; ξÞ ; (10)

where hei ; ej i is the inner product of two tangent vectors and g ¼ gij is the Fisher information
matrix. Rao (1945) used it as the Riemannian metric tensor of M so that the squared length of a
small change in the parameters dξ is deﬁned by
X
ds2 ¼ gij dξ i dξ j ; (11)

showing that ds2 represents the expected increment in the

square of the log likelihood. We
further compose a third-order symmetric tensor T ¼ T ijk in a similar way,

T ijk ¼ E si ðx; ξÞsj ðx; ξÞsk ðx; ξÞ : (12)

Thus, a statistical manifold M is equipped with two tensors, g and T.

A statistic t(x) is said to be sufﬁcient for ξ, when the probability density of x given ξ is written
in the following form:

pðx; ξÞ ¼ p1 ðt; ξÞp2 ðxÞ; (13)

where p2 does not depend on ξ. Intuitively, t(x) is sufﬁcient for estimating ξ, because the latter
term does not include ξ. We consider another statistical model M0,

M 0 ¼ fpðt; ξÞg; (14)

where by an abuse of notation denoting, we let pðt; ξÞ be the probability density of t given ξ.
Since the score vector in M0 is equal to that in M, M and M0 are geometrically the same. In
particular, a reversible transformation of x to y, y ¼ f ðxÞ is a sufficient statistic. This implies that
the geometrical structures g and T do not depend on the representation of the random variable, x
or f (x).
We further require that the geometrical structure should be constructed in such a way that it is
invariant under the transformation of x to its sufficient statistics. Chentsov (1982) required this
by using Markov morphisms in the discrete case and proved a fundamental theorem that the
invariance defines second-order and third-order symmetric tensors uniquely, which are g and
T given in (10) and (12) up to a common scale.
The manifold equipped with the Fisher information Riemannian metric is not Euclidean, but
curved in general. This is shown by calculating the Riemann–Christoffel curvature, where the
Levi-Civita affine connection

1
Γ0ijk ¼ ½ij; k ¼ ∂i gjk þ ∂j gik ∂k gij (15)
2

calculated from g ij is used, as is the usual way in Riemannian geometry. However, by using T,
we modify it to give two new invariant afﬁne connections Γ and Γ∗,
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
254 S. AMARI

1
Γijk ¼ ½ij; k T ijk ; (16)
2
1
Γ∗ijk ¼ ½ij; kþ T ijk : (17)
2

A geodesic ξðtÞ is a trajectory which satisﬁes

d2 X dξ j dξ k
ξ i ðtÞ þ Γijk ¼0 (18)
dt 2 dt dx
when an affine connection Γ is defined. Since there are two affine connections, we have another
type of geodesic satisfying

d2 X ∗ dξ j dξ k
2
ξ i ðtÞ þ Γ ¼ 0: (19)
dt ijk dt dt

Let
X
X¼ X i ei (20)
P
be a tangent vector deﬁned at ξ. The parallel transport of a tangent vector X ¼ X i ei along a
curve
c: ξ i ðtÞ (21)

is also defined by the vector field X ðtÞ defined on the curve by

P
d i d
X ðtÞþ Γmjk gmi X j ξ k ¼ 0; (22)
dt m; j; k dt

where ðgmi Þ is the inverse of ðg mi Þ; see Figure 2.

We have two parallel transports since we have two afﬁne connections, Γ and Γ∗. They are
X ðtÞ and X ∗ ðtÞ, respectively. The inner product of vectors X and Y is given by
hX i X
hX ; Y i ¼ E X i ei Y j ej ¼ X i Y j gij : (23)

When the inner product of X ðtÞ and X ∗ ðtÞ does not change by the two parallel transports of X,

hX ðtÞ; X ∗ ðtÞi ¼ const:; (24)

the two affine connections are said to be dually coupled. It is straightforward to prove the
following theorem (Amari & Nagaoka, 2000).
Theorem 1. The two affine connections Γ and Γ ∗ are dually coupled.
A statistical manifold is not flat in general. When it is flat, we have a local coordinate system ξ
such that the affine connection vanishes

Γijk ðξÞ ¼ 0: (25)

A geodesic line is linear in these coordinates

International Statistical Review (2021), 89, 2, 250–273

© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 255

Figure 2. Parallel transport XðtÞ.

ξ i ðtÞ ¼ ci t þ d (26)

for constants ci and d.

Theorem 2. For a statistical manifold M, when it is flat with respect to one affine connection, it is
automatically flat with respect to the dual affine connection.

We now deﬁne a divergence and related dual structure. A function D½pðx; ξÞ: pðx; ξ 0 Þ ,
(or D½ξ : ξ 0 ), is called a divergence, when the following criteria are satisﬁed:

(1) D½ξ : ξ 0 ≥ 0,
(2) D½ξ : ξ 0 ¼ 0, when and only when ξ ¼ ξP
0
,
(3) in the Taylor expansion D½ξ : ξ þ dξ ¼ gij dξ i dξ j

for infinitesimally small dξ neglecting higher-order terms, g ij is a positive-definite matrix.
A divergence is said to be invariant, when it is the same when we use sufficient statistic t(x)
instead of x. It is in general asymmetric, D½ξ : ξ 0 ≠ D½ξ 0 : ξ , and is regarded as the square of an
(asymmetric) distance between ξ and ξ 0. Rao proposed a divergence (Burbea & Rao, 1982) and
studied the related Riemannian geometry. He also used a class of divergences to elucidata the
performances of the M-estimators Bai et al. (1982).
A typical example is the Kullback–Leibler (KL-) divergence,

pðx; ξÞ
DKL ½ξ : ξ 0 ¼ pðx; ξÞlog
∫
dx (27)
pðx; ξ 0 Þ

and g ij is the Fisher information matrix (hence the Riemannian metric tensor). Another
example is the f-divergence (Csiszár, 1974; Morimoto, 1963) deﬁned by using a convex
function f(u) satisfying
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
256 S. AMARI

f ð1Þ ¼ 0; f 0 ð1Þ ¼ 1 (28)

as

0 pðx; ξ 0 Þ
D½ξ : ξ ¼ ∫ pðx; ξÞf dx: (29)
pðx; ξ Þ

It includes the α-divergence deﬁned by

4 1 þ α

f α ðuÞ ¼ 1 u 2 : (30)
1 α2

The KL-divergence is its special case of the limit α ¼ 1, f ðuÞ ¼ log u.
A divergence induces a Riemannian metric together with dually coupled afﬁne connections
(Eguchi, 1983). The Riemannian metric is given by

g ij ðξÞ ¼ ∂i ∂j D½ξ : ξ 0 jξ 0 ¼ξ : (31)

The dual pair of afﬁne connections are

Γijk ðξÞ ¼ ∂i ∂j ∂0k D½ξ : ξ 0 ξ 0 ¼ξ ; (32)

Γ∗ijk ðξÞ ¼ ∂k ∂0i ∂0j D½ξ : ξ 0 ξ 0 ¼ξ ; (33)
where

∂
∂0k ¼ : (34)
∂ξ 0k
By replacing ξ and ξ 0 ,

D∗ ½ξ : ξ 0 ¼ D½ξ 0 : ξ (35)

is called the dual divergence. It gives the same Riemannian metric and the two afﬁne
connections are interchanged. When a divergence is symmetric, it is self-dual, giving a
Riemannian metric with the Levi-Civita self-dual connection.

3 Exponential Family And Dually Flat Manifold

3.1 Exponential family
An exponential family of probability distributions has the following canonical form of
density:

pðx; θÞ ¼ expfθ · x ψðθÞg (36)

with respect to a dominating measure dμðxÞ , where the parameter θ ¼ θ1 ; …; θm is
called the natural or canonical parameter, and x ¼ ðx1 ; …; xm Þ ∈ Rm is a random variable.
We use θ having upper indices θi instead of ξ . A typical example is the family of Gaussian
distributions,

International Statistical Review (2021), 89, 2, 250–273

where the mean μ and variance σ 2 are parameters. It can be rewritten in the canonical form (36)
by using

x ¼ ðx1 ; x2 Þ ¼ x; x2 ; (38)

μ 1
θ1 ¼ ; θ2 ¼ 2 ; (39)
σ 2 2σ

and the dominating measure is deﬁned on x2 x21 ¼ 0.

The function ψðθÞ

ψðθÞ ¼ log expðθ · xÞdμðxÞ;

∫
(40)

corresponds to the normalisation constant. This is the cumulant generating function and called
the free energy in physics. This is a convex function since ∂i ∂j ψðθÞ is positive-deﬁnite.

3.2 Geometry of an exponential family

The metric and cubic tensors are deﬁned by (10) and (12). Calculations show that they are
given by

gij ðθÞ ¼ ∂i ∂j ψðθÞ; (41)

T ijk ðθÞ ¼ ∂i ∂j ∂k ψðθÞ: (42)

Moreover, calculations show that the afﬁne connection vanishes, that is,

Γijk ðθÞ ¼ 0 (43)

in the θ coordinate system. We call θ an e-coordinate system, since it originates from the
exponential family. This implies that M is flat and the affine connection vanishes in terms of
θ. M is automatically dually flat. Hence, we have another coordinate system η ¼ ðη1 ; …; ηm Þ
in which the dual affine connection Γ∗ vanishes. It is given by the well-known expectation
parameters,

ηi ¼ E½xi : (44)

We call it an m-coordinate system, because a mixture family of probability distributions are

ﬂat with respect to mixture parameters. From (4), we have

ηi ¼ ∂i ψðθÞ; (45)

showing that η is obtained by the Legendre transformation (Barndorff-Nielsen, 1978). The

Legendre dual of ψðθÞ is given by

International Statistical Review (2021), 89, 2, 250–273

max
φðηÞ ¼ fθ · η ψðθÞg (46)
θ

and we have the identity connecting them

ψðθÞþφðηÞ θ · η ¼ 0; (47)

when η is given by (45).

We use the lower index for denoting the dual afﬁne coordinates η ¼ ðηi Þ. The geometrical
quantities are given as

gij ðηÞ ¼ ∂i ∂j φðηÞ; (48)

T ijk ðηÞ ¼ ∂i ∂j ∂k φðηÞ; (49)

∂
∂i ¼ ; (50)
∂ηi
in terms of the dual coordinate system η, where

∂i ¼ gij ∂j ; (51)
P
and we use the Einstein summation convention that the summation symbol is omitted when
the same index i is used in a term one as super index and the other as lower index. Therefore,
P
∂j ¼ gij ∂ ¼ gij ∂i :
i
(52)
i

Here, ðgij Þ is the inverse of gij .
The KL-divergence between two distribution pðx; θÞ and pðx; θ0 Þ of exponential family is

DKL ½θ0 : θ ¼ ψðθÞ φðη0 Þ θ · η0 ; (53)

where η0 is the m-coordinates of θ0. We easily see that the dually ﬂat geometrical structures are
derived from the KL-divergence.

3.3 Dually ﬂat manifold

We now review a fundamental theory of a dually ﬂat manifold including an exponential
family, but not necessarily limited to it.

Theorem 3. Fundamental theorem of dually flat manifoldWhen M is dually flat, the following
holds:
(1) There exist two affine coordinate systems θ and η with respect to two dually flat affine
connections and two convex functions ψðθÞ and φðηÞ. The dual affine connections are

Γijk ðθÞ ¼ 0 (54)

in the e-coordinate system and

International Statistical Review (2021), 89, 2, 250–273

Γ∗ijk ðηÞ ¼ 0 (55)

in the m-coordinate system.

(2) The Riemannian metric is given by gij ðθÞ ¼ ∂i ∂j ψðθÞ in the e-coordinate system and
g ij ðηÞ ¼ ∂i ∂j φðηÞ in the m-coordinate system.
(3) The tangent vectors along the coordinate curves, given by

∂ ∂
ei ¼ i; e ¼
i
; (56)
∂θ ∂ηj
are bi-orthogonal, hei ; ej i ¼ δji .
(4) The cubic tensor is given by

T ijk ðθÞ ¼ ∂i ∂j ∂k ψðθÞ; (57)

T ijk ðηÞ ¼ ∂i ∂j ∂k φðηÞ: (58)

(5) There exists a unique divergence called a canonical divergence,

D½θ0 : θ ¼ ψðθÞþφðη0 Þ θ · η0 ; (59)

where η0 denotes the m-coordinates of θ0 .

The canonical divergence in a dually flat manifold of probability distributions is given by (27)
and is the KL-divergence. The KL-divergence has been used in various fields such as statistics,
information theory and statistical physics as a convenient measure of discrepancy between two
distributions without any explicitly shown reasons. The present theory gives a justification for
the KL-divergence, because it is the canonical divergence of a dually flat manifold. There are
many divergences including one proposed by Rea (Burbea & Rao, 1982). They are not
invariant, having the Riemannian metrics different from the Fisher information.
We remark that there are dually flat statistical manifolds other than the exponential family. A
mixture family of probability distributions given by
( )
Xm X
M ¼ pðx; ηÞ ¼ ηi pi ðxÞ: ηi ¼ 1 ; (60)
i¼0

where p0(x), … , pm(x) are linearly independent prescribed probability distributions. A mixture
family is not exponential family in general but is dually flat. Banerjee et al. (2005) showed that,
under a certain regularity condition, a dually flat manifold M is related to an exponential family
provided the inverse Laplace transform of ψðθÞ exists.
Given a convex function ψðθÞ, we can introduce a dually flat geometrical structure in which θ
is a flat coordinate system and its dual is given by the Legendre transform

η ¼ ∂i ψðθÞ: (61)

The canonical divergence is rewritten as

D½θ0 : θ ¼ ψðθÞ ψ ðθ0 Þ ∇ψ ðθ0 Þ ·ðθ θ0 Þ: (62)

and its dual is

D∗ ½θ: θ0 ¼ D½θ0 : θ ¼ ψ ðθ0 Þ ψ ðθÞ ∇ψ ðθÞ ·ðθ0 θÞ: (63)

This is called the Bregman divergence derived from a convex function.

3.4 Pythagorean theorem and projection theorem

A dually ﬂat manifold M has a nice properties. Given three points P, Q, R ∈ M, we connect P
and Q by the m-geodesic and Q and R by the e-geodesic. Let θP ; θQ ; θR be their e-coordinates
and ηP ; ηQ ; ηR be their m-coordinates. Then, the e-geodesic connecting Q and R is

θðtÞ ¼ ð1 tÞθQ þ tθR ; (64)

and its tangent vector is given by

θ_ ¼ θR θQ : (65)

The m-geodesic connecting P and Q is given by

ηðtÞ ¼ ð1 tÞηP þ tηQ (66)

and its tangent vector is given by

_
ηðtÞ ¼ ηQ ηP : (67)

Theorem 4. Pythagorean theorem (Figure 3). When the m-geodesic connecting P and Q is
orthogonal to the e-geodesic connecting Q and R,

D½P: R ¼ D½P: QþD½Q: R; (68)

where D is the canonical divergence. Dually, when the e-geodesic connecting P and Q is
orthogonal to the m-geodesic connecting Q and R

D∗ ½P: R ¼ D∗ ½P: QþD∗ ½Q: R: (69)

The projection theorem follows directly from the Pythagorean theorem. Let S be a smooth
submanifold in a dually ﬂat manifold M. Let P be a point outside S. We search for the minimiser
P^ ∈ S of D[P : Q], Q ∈ S, or dually the minimiser of D½Q: P ¼ D∗ ½P: Q.

Theorem 5. Projection theorem (Figure 4). The minimiser of D[P : Q], Q ∈ S is obtained by
m-projecting P to S such that m-geodesic connecting P and P^ is orthogonal to S.

Dually, the minimiser of D∗[P : Q] is obtained by e-projecting P to S such that the e-geodesic
connecting P and P^ ∗ is orthogonal to S. The m-projection is unique, when S is e-ﬂat and the

International Statistical Review (2021), 89, 2, 250–273

e-projection is unique, when S is m-ﬂat. The proof is immediate from the Pythagorean theorem.
The two theorems are useful in various applications (Amari, 2016).

3.5 Statistical inference in a curved exponential family

Let D ¼ fx1 ; …; xn g be n independently observed data from an exponential family. The
MLE η^ is simply given by the arithmetic mean of the data,

Figure 3. Pythagorean theorem.

Figure 4. Projection theorem.

International Statistical Review (2021), 89, 2, 250–273

1X
η^ ¼ xi (70)
n

in terms of the dual parameter η. We call the point η^ ∈ M the observed point given directly from
the data D. The empirical distribution of D does not belong to M but the observed point lies in
M. The observed point converges to the true point as n tends to inﬁnity.
We now consider statistical inference in a curved exponential family qðx; uÞ speciﬁed by
parameters u ¼ ðua Þ ¼ ðu1 ; …; ur Þ , r < m. It is a submodel embedded in M, called an
(m, r)-exponential family.

S ¼ fqðx; uÞj qðx; uÞg ¼ exp½θðuÞ · x ψ fθðuÞg: (71)

Estimation is regarded as a projection of the observed point η^ to S ⊂ M. The MLE u^ is given

by the minimiser of the log likelihood, or equivalently the KL-divergence from η^ to S,

arg min
u^ ¼ DKL ½η: S: (72)
S

Because of the Pythagorean theorem, it is given by the m-projection of the observed point η^ to
S. It is Fisher efﬁcient, and the asymptotic variance of the MLE is given by

1
Cov½u^ ¼ Bia gij Bjb (73)
n

where matrix B is

∂θi ðuÞ
Bia ¼ : (74)
∂ua

The matrix

gab ¼ Bia Bjb gij (75)

is the Fisher information matrix of model S.

Higher-order theory of statistical estimation was initiated by Rao (Rao, 1962; Rao
et al., 1982). It is possible to study higher-order asymptotics of an estimator in terms of
geometry, where the e- and m-curvatures play a fundamental role (see, e.g. Amari, 1985;
2016; Amari & Nagaoka, 2000). But it is complicated, so we show only results.

Theorem 6. An estimator is efﬁcient when it is given by the orthogonal projection of η^ to S. It is

higher-order efﬁcient when the orthogonal projection is m-ﬂat. The higher-order asymptotic error
covariance is decomposed in a sum of the m-curvature of the projection trajectory, the e-curvature
of the model S, and a connection term.

For higher-order asymptotic theory of statistical testing; see Kumon & Amari (1983) and
Amari & Nagaoka (2000).

International Statistical Review (2021), 89, 2, 250–273

4 Geometry of Estimation Functions in Semiparametric Statistical Models: The

Neyman–Scott Problem
4.1 Semiparametric model and nuisance parameter
A semiparametric model

S ¼ fpðx; uÞg (76)

is speciﬁed by a parameter of interest u, but the form of p is unknown, and it has inﬁnite degrees
of freedom. Here, we assume that u is a scalar, but it is easy to generalise to the vector case.
A simple example is a location model,

S ¼ fpðx; uÞg; (77)

where p is an arbitrary smooth function having moments and satisfying

∫ pðxÞdx ¼ 1; ∫ xpðxÞdx ¼ 0: (78)

We study here the mixture-type semiparametric model to describe the Neyman–Scott

problem. To this end, we deﬁne a regular model

Q ¼ fqðx; u; ξÞg (79)

having two types of parameters u and ξ. Here, u is the parameter of interest and ξ is the nuisance
parameter which we do not care. Let D ¼ fx1 ; …; xn g be n independent observations, where xi
is an observation from qðx; u; ξ i Þ . Here, the parameter of interest u is common for all
observations but ξ i may differ each time. This model is proposed by Neyman & Scott (1948),
in which they showed that the MLE is not necessarily consistent nor efﬁcient. Then, what is
the best estimator in this situation in the asymptotic sense? The Neyman–Scott problem had
bothered statisticians for many years until Bickel et al. (1994) and Amari & Kawanabe (1997)
presented convincing theories.
It is convenient to assume that unknown values ξ i of the nuisance parameter are randomly
generated from an unknown probability distribution k(ξ). Then, each xi is regarded as an iid
random variable generated from a distribution belonging to the model

S M ¼ fpðx; u; kÞg; (80)

pfx; u; k g ¼ qðx; u; ξÞkðξÞdξ:

∫
(81)

SM includes the nuisance parameter k(ξ) of function degrees of freedom and is called a
semiparametric model (Begun et al., 1983).
SM has function degrees of freedom, so we need to treat a function space of probability
distributions. We do not present a rigorous geometrical theory, which has not yet been
completed (Ay et al., 2017; Pistone & Sempi, 1995). Instead, we give an intuitive arguments,
without specifying the conditions under which our theory holds. But the theory is useful for
practical applications.
Let us consider the tangent space of SM at point (u, k). Since a small deviation in log
probability
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
264 S. AMARI

wðxÞ ¼ δ log pðx; u; kÞ (82)

satisﬁes

E½wðxÞ ¼ 0; (83)

we consider

T u; k ¼ fwðxÞj E½wðxÞ ¼ 0g; (84)

where E is expectation with respect to pðx; u; kÞ, as the tangent space of SM at (u, k), provided
h i
E fwðxÞg2 < ∞ (85)

is satisﬁed. Then the tangent vectors wðxÞ form a Hilbert space. The inner product of two
tangent vectors w1 ðxÞ and w2 ðxÞ is given by

hw1 ðxÞ; w2 ðxÞi ¼ E½w1 ðxÞw2 ðxÞ: (86)

It includes a one-dimensional subspace

d
Tu ¼ log pðx; u; kÞ ; (87)
du

which represents a change in the direction of u. It also includes

d
Tk ¼ log pðx; u; cðtÞÞ ; (88)
dt

where k = c(t) is a curve in SM passing through (u, k) at t ¼ 0. This is the nuisance tangent space
of inﬁnite dimensions. The above two do not cover entire Tu, k, and it includes other vectors aðxÞ
which are orthogonal to both Tu and Tk. They form a subspace called auxiliary tangent subspace
Ta. The tangent space is decomposed in a direct sum,

T u; k ¼ T u ⊕ T k ⊕ T a ; (89)

where Ta is orthogonal to Tu ⊕ Tk.

4.2 Estimating functions

A function f ðx; uÞ is called an estimating function (Godambe, 1991), when it satisﬁes

Eu; ξ ½ f ðx; uÞ ¼ 0; (90)

d
A ¼ Euξ f ðx; uÞ > 0; (91)
du

where Eu, ξ represents expectation with respect to qðx; u; ξÞ for any ξ. Note that (90) and (91)
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 265

hold for expectation with respect to any pðx; u; kÞ, which is a linear combination of qðx; u; ξÞ s.
An estimator u^ is obtained from
X
f ðxi ; uÞ ¼ 0; (92)

where the expectation in (90) is replaced by the arithmetic sum. It is easy to prove that the
estimator is consistent and asymptotically Gaussian having the variance
h i
2
1 E f f ðx; uÞg
V½u^ ¼ ; (93)
n A2

in a similar way as we prove that the MLE is asymptotically Gaussian (Lindsay, 1982).
We need to show when an estimating function exists and which is better when there are many.
In order to characterise the estimation function, we use parallel transports of tangent vectors.
The e andm-transports of a tangent vector wðxÞ from ðu; k 1 Þ to ðu; k 2 Þ are deﬁned,
respectively, by
e
∏kk 21 wðxÞ ¼ wðxÞ Eu; k 2 ½wðxÞ; (94)

m pðx; u; k 1 Þ
∏kk 21 wðxÞ ¼ wðxÞ: (95)
pðx; u; k 2 Þ

It is easy to see that the parallel transports keep the inner product invariant in the following
sense:
e m
hw1 ðxÞ; w2 ðxÞiu; k 1 ¼ h∏kk 21 w1 ðxÞ; ∏kk 21 w2 ðxÞiu; k 2 : (96)

Theorem 7. An estimating function f ðx; uÞ is orthogonal to the nuisance tangent space,

hf ðx; uÞ; vðx; u; ξÞi ¼ 0; (97)

d
vðx; u; ξÞ ¼ log qðx; u; ξÞ (98)
dξ

and is invariant under the e-parallel transport from any ðu; k 1 Þ to ðu; k 2 Þ,
e
∏kk 21 f ðx; uÞ ¼ f ðx; uÞ: (99)

We omit the proof, because it is easily obtained from the deﬁnition.

Let

d
sðx; u; kÞ ¼ log pðx; u; kÞ (100)
du

be the score vector and let sI ðx; u; kÞ be its projection to the subspace orthogonal to Tk. We call it an
information score. We then have the following theorem.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
266 S. AMARI

Theorem 8. An estimating function is given by the following sum:

f ðx; uÞ ¼ sI ðx; uÞþaðxÞ; aðxÞ ∈ T a : (101)

This proves that an estimating function exists when the information score is non-vanishing.
The optimal estimating function is sI ðx; u; kÞ, which we do not know since k is unknown.
However, this gives good means of selecting an estimating function. Even when we choose k
incorrectly, it is still an estimating function. When the true nuisance function is k0, the
optimal estimating function is sI ðx; u; k 0 Þ. An estimating function using the incorrect k is
written as

sI ðx; u; kÞ ¼ sI ðx; u; k 0 Þ þ aðxÞ; (102)

by using some auxiliary aðxÞ, and it still gives a consistent estimator. This is not true when we use
statistical model p(x, u, k) for estimating u with the incorrect k.

4.3 Estimation of rate of proportionality in a linear model

We focus on a speciﬁc example of Neyman–Scott problem, although we can analyse other
problems similarly. Let u be the parameter of interest, which represents the ratio of weight to
volume of sample material,
w
u¼ ; (103)
v

where v is the volume and w is its weight of a specimen. We assume we have the following
observations ðxi ; yi Þ, where xi are noisy observations of volumes and yi are noisy observations
of the weights of various specimens. They are given by

xi ¼ ξ i þ ε i ; (104)

yi ¼ uξ i þ ε0i ; (105)

where εi and εi0 are subject to a Gaussian distribution with mean 0 and variance σ 2
independently.
The joint distribution of x ¼ ðx; yÞ is written as

1 1 h 2 2
i
qðx; y; u; ξÞ ¼ exp ðx ξÞ þ ðy uξÞ : (106)
2πσ 2 2σ 2

We rewrite (106) as

qðx; y; u; ξÞ ¼ expfξsðx; y; uÞþrðx; yÞ ψðu; ξÞg; (107)

where we put
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 267

sðx; y; uÞ ¼ x þ uy; (108)

1 2
rðx; yÞ ¼ x þ y2 : (109)
2

Since ξ is a random variable subject to k(ξ) in SM, we can study the conditional distribution of
ξ given s in (106). This is given by

kðξÞexpfξs þ r ψ g
pðξjsÞ ¼ : (110)
pðx; u; kÞ

We have

E½ξjs ¼ hðuy þ xÞ; (111)

where h is a function depending on k. Using this, we have an explicit form of the nuisance
tangent space Tk. It is spanned by s irrespective of k,

T k ¼ fsðx; y; uÞg: (112)

Hence, it is invariant under the parallel transport from k1 to k2. We further calculate the
information score, obtaining

sI ðx; y; uÞ ¼ ðy uxÞE½ξjs: (113)

This proves that an estimation function is of the form,

f ðx; y; uÞ ¼ ðy uxÞhðuy þ xÞ; (114)

where h(z) is an arbitrary function. The best function h depends on the unknown k.
There are a number of estimators in this speciﬁc problem. We show some of them in the
following:

1) Least square estimator, which minimises

1X
L¼ ðui uxi Þ2 (115)
2

and the solution is simply given by

P
y xi
u^ ¼ P i 2 : (116)
xi

2) MLE or the total least square

This is the estimator maximising log likelihood of all observations with respect to all the
unknown parameters u and ξ 1, … , ξ n. This minimises the sum of squares of the lengths of lines
that project the observed ðxi ; yi Þ to the regression line

International Statistical Review (2021), 89, 2, 250–273

y ¼ ux: (117)

The solution is given by solving

X
ðyi uxi Þðuyi þ xi Þ ¼ 0: (118)

3) Ratio of the sum of total weight to the sum of total volume, simply given by
P
y
u^ ¼ P i : (119)
xi

The least square solution is not good in this case, since it does not give even a consistent
estimator. Both MLE and the ratio estimators are consistent. Which is better? It depends on
the distribution k(ξ). Roughly speaking, when the average of k(ξ) is much larger than its
standard deviation, MLE is better, but when it is smaller, the ratio estimator is better.
Amari & Kawanabe (1997) proposed an estimator of which estimating function includes a
parameter c,

f ðx; y; uÞ ¼ ðy uxÞðuy þ x þ cÞ: (120)

When c ¼ 0, it gives MLE. When c → ∞, it gives the ratio estimator. They proposed a method
of determining c from observed data.

5 Hierarchical Model and Higher-Order Interactions

Many statistical models have hierarchical structures such that lower-order submodels are
included in higher-order submodels. Markov chains consisting of various orders are such
models in which a k-th order submodel is included in a (k + 1)th order submodel. An
autoregression (AR) model in time series is another example, in which a k-th order submodel
is included in a (k + 1)-th order submodel.
We study a hierarchical structure of an exponential family Sn consisting of
" #
X
n n o
pðx; θÞ ¼ exp θðiÞ · xðiÞ ψ θð1Þ ; …; θðnÞ ; (121)
i¼1

where x ¼ xð1Þ ; …; xðnÞ , θ ¼ θð1Þ ; …; θðnÞ . The corresponding η -coordinates are η ¼

ηð1Þ ; …; ηðnÞ . Let Sk be a submodel deﬁned by
n o
S k ¼ pðx; θÞθðk þ 1Þ ¼ … ¼ θðnÞ ¼ 0 : (122)

Then, we have a hierarchical structure S1 ⊂ S2 … ⊂Sn.

We divide the θ coordinates into two parts, θ ¼ Θk ; Θk þ 1 , Θk ¼ θð1Þ ; …; θðkÞ , Θk þ 1 ¼
ðk þ 1Þ
θ ; …; θðnÞ , which is called a k-cut. We also divide the η coordinates as η ¼

H k ; H k þ 1 , H k ¼ ηð1Þ ; …; ηðkÞ , H k þ 1 ¼ ηðk þ 1Þ ; …; ηðnÞ . Since θi is orthogonal to
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 269

ηj (j ≠ i), or more precisely the tangent vector ei along the coordinate θi is orthogonal to
the tangent vector ej along the coordinate ηj (j ≠ i), parameters Θk + 1 are orthogonal to
parameters Hk.
It is convenient to deﬁne a new coordinate system, Ξ ¼ H k ; Θk þ 1 , which consists of the
lower part of η coordinates and the higher part of θ coordinates. We call it a mixed coordinate
system (Amari, 2001). Let us consider a submanifold speciﬁed by

H k ¼ cð1Þ ; …; cðkÞ (123)

for fixed vectors cð1Þ ; …; cðkÞ . Complementary, we consider a submanifold specified by
Θk þ 1 ¼ d ðk þ 1Þ ; … ; d ðnÞ for fixed vectors d ðk þ 1Þ ; …; d ðnÞ . These two submanifolds are
orthogonal to each other for any c and d. Note that two submaifolds Hk and Hk + 1 or Θk and
Θk + 1 are not orthogonal.
We consider a simple binary model for x ¼ ðx1 ; …; xn Þ, each xi taking 0 or 1. Given an
observed contingency table T ðx1 ; x2 ; …; xn Þ, we have interest in knowing how xis are mutually
interrelated in probability distribution pðx1 ; … ; xn Þ. There are pairwise interactions, three-way
interactions and further higher-order interactions among variables x1, … , xn. How can we
quantify these interactions? The mixture coordinates and the k-cuts are useful for defining the
degrees of interactions among random variables. We show that the interaction terms can be
decomposed orthogonally into pairwise, triplewise and higher-order ones.
We use a neural network of binary neurons as a typical example for intuitive explanation. The
network consists of n connected neurons interacting one another. Each neuron takes two states,
excited or quiescent. Let x1, … , xn be n binary random variables, where xi ¼ 1 represents that
the i-th neuron is excited and xi ¼ 0 quiescent. The current state x ¼ ðx1 ; …; xn Þ is regarded as
a vector random variable, and its joint probability distribution p(x1, … , xn) is shown in the form
of an exponential family,

nX X o
pðx; θÞ ¼ exp θ i xi þ θij xi xj þ ⋯þθ12⋯n x1 ⋯xn ψðθÞ : (124)

Its e-coordinates are organised as θ ¼ Θ1 ; Θ2 ; …; Θn , where subvectors Θk ¼ θi1 …ik ;
i1 < … < ik , summarise the k-th order terms.
The corresponding m coordinates are H k ¼ ηi1 …ik , where ηi1 …ik ¼ E½xi1 …xik . η is
decomposed as

η ¼ ðH 1 ; … ; H n Þ: (125)

The subvector Hk represents the probability of k neurons jointly ﬁring, that is, the probability
of xi1 xi2 …xik ¼ 1. It is important to see that the coordinate axes of Hk are orthogonal to those of
Θk0 when k0 > k. A submodel

S k ¼ θΘk þ 1 ¼ … ¼ Θn ¼ 0 (126)

is called a k-th order model. We have a hierarchical structure S0 ⊂ S1 ⊂ … ⊂ Sn, where S0 is a

single point speciﬁed θ ¼ 0.

International Statistical Review (2021), 89, 2, 250–273

We begin with a simple model of two neurons, n ¼ 2. The probability distribution of x is

pðx; θÞ ¼ exp θ1 x1 þ θ2 x2 þ θ12 x1 x2 ψðθÞ : (127)

When θ12 ¼ 0, two neurons fire independently, pðxÞ ¼ p1 ðx1 Þp2 ðx2 Þ and there are no mutual
interactions between the two. When two neurons are not independent, there is mutual
interaction. We want to know its degree. The covariance σ 2 ¼ E½x1 x2 E½x1 E½x2 is 0, when
the two are independent, which clearly shows a degree of mutual interaction. However, there are
many such quantities that vanish when the two neurons fire independently. The increase in log
likelihood, due to an increase in firing rates, is correlated to the increase of log likelihood due to
the covariance. So the firing rates and the covariance are not separated. Geometrically speaking,
we want to have such a quantity that the direction due to an increase in interaction is orthogonal
to the directions due to increases in firing rates. This quantity represents the degree of
interaction that does not change even when the firing rates change. The covariance is not such
a quantity.
We know that the e-coordinate θ12 is orthogonal to firing rates η1 and η2, so it represents the
interaction orthogonal to the firing rates. As is seen in Figure 5, S1 defined by θ12 ¼ 0 is an e-flat
submanifold consisting of all the independent distributions.
Given a general distribution, we m-project it to S1. Then, we have a distribution in S1 which
has the same firing rates but no interaction. The submanifold M1 defined by η1 ¼ c1 ; η2 ¼ c2,
for constants c1 and c2, is m-flat, consisting of the distributions having the same firing rates c1
and c2 but the degree of interaction θ12 is different. They are orthogonal intersecting at ηi ¼ ci,
θ12 ¼ 0.
Let p0 be the distribution specified by θ ¼ 0, that is η1 ¼ η2 ¼ 1=2 and θ12 ¼ 0 . The
KL-divergence from p to p0 is decomposed as

DKL ½p: p0 ¼ DKL ½p: p^ þ DKL ½p^: p0 ; p^ ∈ S 1 ∩M 1 : (128)

The ﬁrst term shows the effect of deviations of ﬁring rates from 1/2, that is, the
KL-divergence from p^ to p0. The second represents the effect of interaction shown by
the KL-divergence from p to the independent model p^.

Figure 5. Orthogonal decomposition.

International Statistical Review (2021), 89, 2, 250–273

We next consider the case of n ¼ 3 neurons. It has a hierarchical structure S1 ⊂ S2 ⊂ S3. The
S1 is an independent model,

S 1 ¼ fpðxÞjpðxÞ ¼ p1 ðx1 Þp2 ðx2 Þp3 ðxÞg (129)

and ηis show the firing rates of neurons. Since directions of θ ij, θ123 are orthogonal to the
directions of ηi, they together represent degrees of neural interactions
n o to the firing
orthogonal

rates. Further, consider submanifold specified by M 2 ¼ η ηi ¼ ci ; ηij ¼ cij , for fixed ci

and cij. The pair ηi ; ηij represent both the firing rates and pairwise joint firing rates, but do
not specify the triple firing rate, η123 ¼ E½x1 x2 x3 . The quantity that is orthogonal to both
the firing rates and pairwise firing rates is given by θ123. So,
p111 p100 p010 p001
θ123 ¼ log (130)
p110 p101 p011 p000

represents the triplewise interaction orthogonal to the ﬁring and pairwise ﬁring rates.
A distribution p is m-projected to S2 and S1, where the KL-divergence is decomposed as

DKL ½p: p0 ¼ DKL ½p: p^2 þ DKL ½p^2 : p^1 þ DKL ½p^1 : p^0 (131)

by the Pythagorean theorem.

In a general model of n neurons, we consider the mixed coordinates of the k-cut

ξ k ¼ ηi ; … ; ηi1 …ik ; θi1 …ik þ 1 ; …; θ1…n : (132)

Hk represent the joint ﬁring rates up to order k and Θk ¼ θi; …ik þ 1 ; … ; θ1…n represents
degrees of mutual interactions of orders higher than k orthogonal to Hk. Therefore, θi1 …ik
represent the degrees of (k + 1)-th order interactions. The KL-divergence from p to the origin
is decomposed as
nX
1
DKL ½p: p0 ¼ DKL p^k þ 1 : p^k ; (133)
k¼0

where p^n ¼ p, p^k is the projection of p to Sk and p^0 ¼ p0 . This shows that DKL p^k þ 1 : p^k
represents the effect of the order k interactions. We show only binary models, but we can
generalise the theory to any hierarchical dually ﬂat models.

6 Conclusions
Since Rao’s proposal three quarters of a century ago, information geometry, the geometrical
theory of a manifold of probability distributions has been developed widely, giving useful tools
for various ﬁelds related to probability. This paper reviews only some of interesting structures to
which information geometry gives remarkable contributions, such as dually ﬂat structures and
discusses some applications.
There are some new developments in this area. Wong (2018) gives a new mathematical idea,
which generalises the Legendre transformation, and presents a new theory applicable to
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
272 S. AMARI

projectively flat manifolds. The Pythagorean and projection theorems hold, where the Rényi
divergence plays the role of the canonical divergence in a dually projectively flat manifold.
The Wasserstein distance gives another type of divergence in a manifold of probability
distributions and has broad applicability. The distance is responsible for the metric structure
of the base space on which distributions are defined. However, it is not invariant. It is interesting
to find a theory connecting them (Amari et al., 2019). Li & Zhou (2019) proposed a fundamen-
tal theory unifying the two geometries. See Amari & Matsuda (2021) for more detail on
Wasserstain statistics. Amari (2016) discusses various applications.

References
Amari, S. 1982. Differential geometry of curved exponential families—curvature and information loss. Ann. Stat., 10,
357–385.
Amari, S. 1985. Differential-Geometrical Methods in Statistics, Lecture Notes in Statistics, Vol. 28. Springer.
Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory, 47,
1701–1711.
Amari, S. 2016. Information Geometry and Its Applications. Springer.
Amari, S., Karakida, R., Oizumi, M. & Cuturi, M. 2019. Information geometry for regularized optimal transport and
barycenters of patterns. Neural Comput., 31, 827–848.
Amari, S. & Kawanabe, M. 1997. Information geometry of estimating functions in semi-parametric statistical models.
Bernoulli, 3, 29–54.
Amari, S. & Matsuda, T. 2021. Wasserstein statistics in 1D location-scale model. Annals of Institute of Statistical
Mathematics.
Amari, S. & Nagaoka, H. 2000. Methods of information geometry. American Mathematical Society and Oxford
University Press.
Ay, N., Jost, J., Lé, H.V. & Schwanchhöfer, L. 2017. Information Geometry. Springer.
Bai, Z.D., Rao, C.R. & Wu, Y. 1982. M-estimation of multivariate linear regression parameters under a convex
discrepancy function. Stat. Sin., 2, 237–264.
Banerjee, A., Merugu, S., Dhillon, I. & Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res.,
6, 1705–1749.
Barndorff-Nielsen, O.E. 1978. Information and Exponential Families in Statistical Theory. Wiley.
Begun, J.M., Hall, W.J., Huang, W.M. & Wellner, J.A. 1983. Information and asymptotic efficiency in
parametric-nonparametric models. Ann. Stat., 11, 432–452.
Bickel, P.J., Ritov, C.A.J. & Wellner, J.A. 1994. Efficient and Adaptive Estimation for Semiparametric Models. Johns
Hopkins University Press.
Burbea, J. & Rao, C.R. 1982. Entropy differential metric, distance and divergence measures in probability spaces: a
unified approach. J. Multivar. Anal., 12, 575–596.
Chentsov, N.N. 1982. Statistical decision rules and optimal inference. Nauka, 1972 (in Russian; English Translation,
AMS.
Cramér, H. 1946. Mathematical Methods of Statistics. Princeton University Press.
Csiszár, I. 1974. Information measures: a critical survey, pp. 83–86, Proc. 7th Conf. Inf. Theory, Prague, Czech
Republic.
Dawid, A.P 1975. (Discussions to B. Efron).
Efron, B. 1975. Defining the curvature of a statistical problem (with application to second order efficiency). Ann. Stat.,
3, 1189–1242.
Eguchi, S. 1983. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.,
11, 793–803.
Godambe, V.P. 1991. Estimating Functions. Oxford University Press.
Kumon, K. & Amari, S. 1983. Geometrical theory of higher-order asymptotics of test, interval estimator and condi-
tional inference. Proc. Royal Society of London, A, 387, 429–458.
Li, W. & Zhou, W. 2019. Wasserstein information matrix. arXiv.
Lindsay, B. 1982. Conditional score functions: some optimality results. Biometrika, 69, 503–512.
Morimoto, T. 1963. Markov processes and the H-theorem. J. Phys. Soc Jap., 12, 328–331.
Nagaoka, H. & Amari, S. 1982. Differential geometry of smooth families of probability distributions. In METR 82-7,
UTokyo.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 273

Neyman, J. & Scott, E.L. 1948. Consistent estimates based on partially consistent observation. Econometrika, 16,
1–32.
Pistone, G. & Sempi, C. 1995. An infinite-dimensional geometric structure on the space of all the probability
measures equivalent to a given one. Ann. Stat., 23, 1543–1561.
Rao, C.R. 1945. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math.
Soc., 37, 81–91.
Rao, C.R. 1962. Efficient estimators and optimum inference in large samples. J. R. Stat. Soc. B., 24, 46–72.
Rao, C.R., Sinha, B.K. & Subramanyam, K. 1982. Third order efficiency of the maximum likelihood estimator in the
multinomial distributions. Stat. Decisions, 1, 1–16.
Wong, T.-K.L. 2018. Logarithmic divergence from optimal transport and Renyi geometry. Inf. Geom., 1, 39–78.

[Received October 2020; accepted June 2021]

International Statistical Review (2021), 89, 2, 250–273

An Elementary Introduction To Information Geometry
No ratings yet
An Elementary Introduction To Information Geometry
63 pages
Differential Geometry in Statistical Inference
75% (4)
Differential Geometry in Statistical Inference
246 pages
Amari, Nagaoka - Methods of Information Geometry
100% (3)
Amari, Nagaoka - Methods of Information Geometry
108 pages
An Elementary Introduction To Information Geometry
No ratings yet
An Elementary Introduction To Information Geometry
61 pages
Information Geometry: Shun-Ichi Amari
No ratings yet
Information Geometry: Shun-Ichi Amari
48 pages
An Elementary Introduction To Information Geometry
No ratings yet
An Elementary Introduction To Information Geometry
56 pages
Information Geometry
No ratings yet
Information Geometry
6 pages
Entropy EIG
No ratings yet
Entropy EIG
62 pages
Pennec - Intrinsic Statistics On Riemannian Manifolds
No ratings yet
Pennec - Intrinsic Statistics On Riemannian Manifolds
40 pages
Information Geometry
No ratings yet
Information Geometry
411 pages
On Almost Norden Statistical Man : Entropy
No ratings yet
On Almost Norden Statistical Man : Entropy
1 page
Geometry and Applied Statistics: Paul Marriott
No ratings yet
Geometry and Applied Statistics: Paul Marriott
17 pages
Non Singularity of The Asymptotic Fisher Information Matrix in Hidden Markov Models
No ratings yet
Non Singularity of The Asymptotic Fisher Information Matrix in Hidden Markov Models
13 pages
Information Geometry in ML & Optimization
No ratings yet
Information Geometry in ML & Optimization
3 pages
HDT SOP Report 1 Copy Copy
No ratings yet
HDT SOP Report 1 Copy Copy
19 pages
Snoussi 2007
No ratings yet
Snoussi 2007
45 pages
An Introduction To Maximum Likelihood Estimation A PDF
No ratings yet
An Introduction To Maximum Likelihood Estimation A PDF
21 pages
A Geometric Modeling of Occam's Razor in Deep Learning
No ratings yet
A Geometric Modeling of Occam's Razor in Deep Learning
37 pages
Yunshu InformationGeometry PDF
No ratings yet
Yunshu InformationGeometry PDF
79 pages
Info Geometry for Statisticians
No ratings yet
Info Geometry for Statisticians
79 pages
ResearchCards 18sept2020 PDF
No ratings yet
ResearchCards 18sept2020 PDF
172 pages
A Geometric Modeling of Occam's Razor in Deep Learning: How To Measure The Simplicity or The Complexity of A Model
No ratings yet
A Geometric Modeling of Occam's Razor in Deep Learning: How To Measure The Simplicity or The Complexity of A Model
31 pages
Algebraic and Geometric Methods in Statistics 1st Edition Paolo Gibilisco Instant Download
100% (7)
Algebraic and Geometric Methods in Statistics 1st Edition Paolo Gibilisco Instant Download
43 pages
Schwarz EstimatingDimensionModel 1978
No ratings yet
Schwarz EstimatingDimensionModel 1978
5 pages
Bala Krishnan 2015
No ratings yet
Bala Krishnan 2015
10 pages
121 Testing Manifold
No ratings yet
121 Testing Manifold
67 pages
Nearly Kaehler Statistical Manifolds
No ratings yet
Nearly Kaehler Statistical Manifolds
19 pages
Hidden Markov Chains and Fields With Observations in Riemannian Manifolds
No ratings yet
Hidden Markov Chains and Fields With Observations in Riemannian Manifolds
8 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Affine Statistical Bundle Modeled On A Gaussian Orlicz-Sobolev Space
No ratings yet
Affine Statistical Bundle Modeled On A Gaussian Orlicz-Sobolev Space
22 pages
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
No ratings yet
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
27 pages
Brody and Hook
No ratings yet
Brody and Hook
34 pages
1 s2.0 S0047259X11001047 Main
No ratings yet
1 s2.0 S0047259X11001047 Main
11 pages
Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
No ratings yet
Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
33 pages
Estimating Riemannian Metric with Noise
No ratings yet
Estimating Riemannian Metric with Noise
24 pages
Amari Methods
No ratings yet
Amari Methods
38 pages
Borovkov A Mathematical Statistics
100% (1)
Borovkov A Mathematical Statistics
592 pages
Differential-Geometrical Methods in Statistics
No ratings yet
Differential-Geometrical Methods in Statistics
301 pages
The American Mathematical Monthly, Vol. 60, No. 1
No ratings yet
The American Mathematical Monthly, Vol. 60, No. 1
3 pages
Entropy 17 04215
No ratings yet
Entropy 17 04215
40 pages
Geomstats Tutorial
No ratings yet
Geomstats Tutorial
132 pages
I Prior
No ratings yet
I Prior
23 pages
Classics: 76 Resonance
No ratings yet
Classics: 76 Resonance
15 pages
Density Estimation On Symmetric Spaces
No ratings yet
Density Estimation On Symmetric Spaces
41 pages
Barndorff-Nielsen 1987
No ratings yet
Barndorff-Nielsen 1987
68 pages
Robert v. Hogg, Allen T. Craig - Introduction To M
No ratings yet
Robert v. Hogg, Allen T. Craig - Introduction To M
448 pages
Nielsen A Geometric Modeling of Occams Razor in Deep Learning
No ratings yet
Nielsen A Geometric Modeling of Occams Razor in Deep Learning
41 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
Algorithms 17 00112 v2
No ratings yet
Algorithms 17 00112 v2
11 pages
R Supplementary Distributions Guide
No ratings yet
R Supplementary Distributions Guide
26 pages
B-S Distribution (Kundu)
No ratings yet
B-S Distribution (Kundu)
108 pages
Upper and Lower Bounds For Stochastic Processes: Michel Talagrand
No ratings yet
Upper and Lower Bounds For Stochastic Processes: Michel Talagrand
727 pages
Lindsay MixtureModelsTheory 1995
No ratings yet
Lindsay MixtureModelsTheory 1995
172 pages
Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
No ratings yet
Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
32 pages
Dresher 1964
No ratings yet
Dresher 1964
3 pages
Assumption 2
No ratings yet
Assumption 2
10 pages
Mathematical Foundations of Infinite Dimensional Statistical Models 1st Edition Evarist Giné Instant Download Full Chapters
100% (4)
Mathematical Foundations of Infinite Dimensional Statistical Models 1st Edition Evarist Giné Instant Download Full Chapters
160 pages
Nonparametric Inference On Manifolds
No ratings yet
Nonparametric Inference On Manifolds
252 pages
Entropy 23 01393 v2
No ratings yet
Entropy 23 01393 v2
24 pages
Introduction To Statistical Methods: BITS Pilani
No ratings yet
Introduction To Statistical Methods: BITS Pilani
40 pages
Factors Influencing Purchase Decisions
No ratings yet
Factors Influencing Purchase Decisions
24 pages
Intro to Random Variables & Probability
No ratings yet
Intro to Random Variables & Probability
17 pages
Why Monte Carlo Important
No ratings yet
Why Monte Carlo Important
7 pages
Assignment 1 Maths - II Even Sem (2024-25)
No ratings yet
Assignment 1 Maths - II Even Sem (2024-25)
4 pages
Essentials of Statistics For Business & Economics 9th Edition David R. Anderson - Ebook PDF All Chapters Instant Download
100% (3)
Essentials of Statistics For Business & Economics 9th Edition David R. Anderson - Ebook PDF All Chapters Instant Download
62 pages
Education, Politics & Statistics Analysis
No ratings yet
Education, Politics & Statistics Analysis
14 pages
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
0% (1)
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
72 pages
I. Using The Z Table (Table E), Find The Critical Value (Or Values) For Each. (3 Pts Each)
No ratings yet
I. Using The Z Table (Table E), Find The Critical Value (Or Values) For Each. (3 Pts Each)
6 pages
Lecture 6 Hidden Markov and Maximum Entropy Models
No ratings yet
Lecture 6 Hidden Markov and Maximum Entropy Models
28 pages
Novelty Detection A Review
No ratings yet
Novelty Detection A Review
17 pages
Probability I Assessment Questions
No ratings yet
Probability I Assessment Questions
2 pages
Groebner ch05
No ratings yet
Groebner ch05
69 pages
Homework 1 Solutions
No ratings yet
Homework 1 Solutions
4 pages
Stochastic Processes Assignment
No ratings yet
Stochastic Processes Assignment
2 pages
PTSP - MLRS - R22 - II - I - ECE - Syllabus
No ratings yet
PTSP - MLRS - R22 - II - I - ECE - Syllabus
2 pages
PST QA 2015 2023
No ratings yet
PST QA 2015 2023
99 pages
RI (JC) Probability Tutorial Challenging Questions
No ratings yet
RI (JC) Probability Tutorial Challenging Questions
4 pages
Statistics and Probability For VERSION 3
70% (20)
Statistics and Probability For VERSION 3
71 pages
Full Download 96695 PDF
No ratings yet
Full Download 96695 PDF
81 pages
Statistics Explained, 4th Edition Full PDF Download
100% (13)
Statistics Explained, 4th Edition Full PDF Download
14 pages
FE-Civil Errata 97-1 10.19.20 PDF
No ratings yet
FE-Civil Errata 97-1 10.19.20 PDF
2 pages
Statistics & Psychometrics Guide
No ratings yet
Statistics & Psychometrics Guide
32 pages
Question Paper
No ratings yet
Question Paper
6 pages
Conditional Probability and Bayes Theorem
100% (1)
Conditional Probability and Bayes Theorem
68 pages
CH9 Special Probability Distribution
No ratings yet
CH9 Special Probability Distribution
7 pages
Anderson-Darling Test Guide
No ratings yet
Anderson-Darling Test Guide
37 pages
Chapter 10 Power Point Slides
No ratings yet
Chapter 10 Power Point Slides
26 pages
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
No ratings yet
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
14 pages