Int Statistical Rev - 2021 - Amari - Information Geometry
Int Statistical Rev - 2021 - Amari - Information Geometry
12464
Information Geometry
Shun-ichi Amari1,2
1
ACRO, Teikyo University, Tokyo, Japan
2
RIKEN Center for Brain Science, Saitama, Japan
E-mail: [email protected]
Summary
Statistical inference is constructed upon a statistical model consisting of a parameterised family
of probability distributions, which forms a manifold. It is important to study the geometry of the
manifold. It was Professor C. R. Rao who initiated information geometry in his monumental paper
published in 1945. It not only included fundamentals of statistical inference such as the
Cramér–Rao theorem and Rao–Blackwell theorem but also proposed differential geometry of a
manifold of probability distributions. It is a Riemannian manifold where Fisher–Rao information
plays the role of the metric tensor. It took decades for the importance of the geometrical structure
to be recognised. The present article reviews the structure of the manifold of probability distribu-
tions and its applications and shows how the original idea of Professor Rao has been developed and
popularised in the wide sense of statistical sciences including AI, signal processing, physical sciences
and others.
Key words: information geometry; Fisher–Rao information; dual affine connections; generalised
Pythagorean theorem.
1 Introduction
Information geometry studies the structure of a regular statistical model, which forms a
manifold M. It consists of probability distributions parameterised by an m-dimensional vector,
where the parameters constitute a coordinate system. It is a natural question to search how
different two distributions in M are, that is, a distance or divergence between two distributions.
It was Professor Rao’s monumental paper (Rao, 1945) that answered the question. He
proposed a fundamental theory to show that M is a Riemannian manifold, where the Fisher
information matrix plays the role of the Riemannian metric. At the same time, Rao presented
a fundamental theorem in statistics, which was later called the Cramér–Rao theorem, because
it was also independently proved by Cramér (1946). Rao calculated the Riemannian distance
between two Gaussian distributions with different means and variances. The present review
summarises Information Geometry initiated by C. R. Rao to show how Rao’s idea has been
developed.
The geometrical approach was so fundamental that it took decades for researchers to develop
his idea. Chentsov (1982) followed the idea and answered the question of why the Fisher
information should be used, by proposing an invariance criterion. The Fisher information matrix
which Rao used is unique from the point of view of invariance under Markovian morphisms of
statistical models. He further showed that a third-order symmetric tensor exists and is also
unique. These two tensors together define invariant affine connections to be introduced in M.
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 251
∂
where ∂i ¼ .
∂ξ i
This is a fundamental quantity in statistics. For n independent observations x1, … , xn, the
MLE ξ^ is the maximiser of the log likelihood and is obtained as the solution of the likelihood
equations
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
252 S. AMARI
X
n
si xj ; ξ ¼ 0; i ¼ 1; …; m: (3)
j¼1
where E is the expectation with respect to pðx; ξÞ. The score is a vector representing how the log
likelihood changes as ξ changes from ξ to ξ þ dξ,
X
d log pðx; ξÞ ¼ si ðx; ξÞdξ i : (5)
A component si ðx; ξÞ of the score is regarded as the tangent vector ei of M in the direction of a
coordinate curve ξ i; see Figure 1.
This implies that the geometrical tangent vector
∂
ei ¼ (6)
∂ξ i
is represented by the score si ðx; ξÞ, which is a random variable. The tangent vector dξ ¼ ðdξ i Þ is
the score
X X
dξ ¼ dξ i ei ¼ dξ i si ðx; ξÞ: (7)
When we identify a small change dξ with a small increment of the corresponding log
likelihood
X
d log pðx; ξÞ ¼ dξ i ei ; (8)
where p2 does not depend on ξ. Intuitively, t(x) is sufficient for estimating ξ, because the latter
term does not include ξ. We consider another statistical model M0,
where by an abuse of notation denoting, we let pðt; ξÞ be the probability density of t given ξ.
Since the score vector in M0 is equal to that in M, M and M0 are geometrically the same. In
particular, a reversible transformation of x to y, y ¼ f ðxÞ is a sufficient statistic. This implies that
the geometrical structures g and T do not depend on the representation of the random variable, x
or f (x).
We further require that the geometrical structure should be constructed in such a way that it is
invariant under the transformation of x to its sufficient statistics. Chentsov (1982) required this
by using Markov morphisms in the discrete case and proved a fundamental theorem that the
invariance defines second-order and third-order symmetric tensors uniquely, which are g and
T given in (10) and (12) up to a common scale.
The manifold equipped with the Fisher information Riemannian metric is not Euclidean, but
curved in general. This is shown by calculating the Riemann–Christoffel curvature, where the
Levi-Civita affine connection
1
Γ0ijk ¼ ½ij; k ¼ ∂i gjk þ ∂j gik ∂k gij (15)
2
calculated from g ij is used, as is the usual way in Riemannian geometry. However, by using T,
we modify it to give two new invariant affine connections Γ and Γ∗,
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
254 S. AMARI
1
Γijk ¼ ½ij; k T ijk ; (16)
2
1
Γ∗ijk ¼ ½ij; kþ T ijk : (17)
2
d2 X ∗ dξ j dξ k
2
ξ i ðtÞ þ Γ ¼ 0: (19)
dt ijk dt dt
Let
X
X¼ X i ei (20)
P
be a tangent vector defined at ξ. The parallel transport of a tangent vector X ¼ X i ei along a
curve
c: ξ i ðtÞ (21)
When the inner product of X ðtÞ and X ∗ ðtÞ does not change by the two parallel transports of X,
the two affine connections are said to be dually coupled. It is straightforward to prove the
following theorem (Amari & Nagaoka, 2000).
Theorem 1. The two affine connections Γ and Γ ∗ are dually coupled.
A statistical manifold is not flat in general. When it is flat, we have a local coordinate system ξ
such that the affine connection vanishes
ξ i ðtÞ ¼ ci t þ d (26)
Theorem 2. For a statistical manifold M, when it is flat with respect to one affine connection, it is
automatically flat with respect to the dual affine connection.
We now define a divergence and related dual structure. A function D½pðx; ξÞ: pðx; ξ 0 Þ ,
(or D½ξ : ξ 0 ), is called a divergence, when the following criteria are satisfied:
(1) D½ξ : ξ 0 ≥ 0,
(2) D½ξ : ξ 0 ¼ 0, when and only when ξ ¼ ξP
0
,
(3) in the Taylor expansion D½ξ : ξ þ dξ ¼ gij dξ i dξ j
for infinitesimally small dξ neglecting higher-order terms, g ij is a positive-definite matrix.
A divergence is said to be invariant, when it is the same when we use sufficient statistic t(x)
instead of x. It is in general asymmetric, D½ξ : ξ 0 ≠ D½ξ 0 : ξ , and is regarded as the square of an
(asymmetric) distance between ξ and ξ 0. Rao proposed a divergence (Burbea & Rao, 1982) and
studied the related Riemannian geometry. He also used a class of divergences to elucidata the
performances of the M-estimators Bai et al. (1982).
A typical example is the Kullback–Leibler (KL-) divergence,
pðx; ξÞ
DKL ½ξ : ξ 0 ¼ pðx; ξÞlog
∫
dx (27)
pðx; ξ 0 Þ
and g ij is the Fisher information matrix (hence the Riemannian metric tensor). Another
example is the f-divergence (Csiszár, 1974; Morimoto, 1963) defined by using a convex
function f(u) satisfying
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
256 S. AMARI
4 1 þ α
f α ðuÞ ¼ 1 u 2 : (30)
1 α2
The KL-divergence is its special case of the limit α ¼ 1, f ðuÞ ¼ log u.
A divergence induces a Riemannian metric together with dually coupled affine connections
(Eguchi, 1983). The Riemannian metric is given by
∂
∂0k ¼ : (34)
∂ξ 0k
By replacing ξ and ξ 0 ,
D∗ ½ξ : ξ 0 ¼ D½ξ 0 : ξ (35)
is called the dual divergence. It gives the same Riemannian metric and the two affine
connections are interchanged. When a divergence is symmetric, it is self-dual, giving a
Riemannian metric with the Levi-Civita self-dual connection.
where the mean μ and variance σ 2 are parameters. It can be rewritten in the canonical form (36)
by using
x ¼ ðx1 ; x2 Þ ¼ x; x2 ; (38)
μ 1
θ1 ¼ ; θ2 ¼ 2 ; (39)
σ 2 2σ
corresponds to the normalisation constant. This is the cumulant generating function and called
the free energy in physics. This is a convex function since ∂i ∂j ψðθÞ is positive-definite.
Moreover, calculations show that the affine connection vanishes, that is,
in the θ coordinate system. We call θ an e-coordinate system, since it originates from the
exponential family. This implies that M is flat and the affine connection vanishes in terms of
θ. M is automatically dually flat. Hence, we have another coordinate system η ¼ ðη1 ; …; ηm Þ
in which the dual affine connection Γ∗ vanishes. It is given by the well-known expectation
parameters,
ηi ¼ E½xi : (44)
ηi ¼ ∂i ψðθÞ; (45)
max
φðηÞ ¼ fθ · η ψðθÞg (46)
θ
ψðθÞþφðηÞ θ · η ¼ 0; (47)
∂i ¼ gij ∂j ; (51)
P
and we use the Einstein summation convention that the summation symbol is omitted when
the same index i is used in a term one as super index and the other as lower index. Therefore,
P
∂j ¼ gij ∂ ¼ gij ∂i :
i
(52)
i
Here, ðgij Þ is the inverse of gij .
The KL-divergence between two distribution pðx; θÞ and pðx; θ0 Þ of exponential family is
where η0 is the m-coordinates of θ0. We easily see that the dually flat geometrical structures are
derived from the KL-divergence.
Theorem 3. Fundamental theorem of dually flat manifoldWhen M is dually flat, the following
holds:
(1) There exist two affine coordinate systems θ and η with respect to two dually flat affine
connections and two convex functions ψðθÞ and φðηÞ. The dual affine connections are
∂ ∂
ei ¼ i; e ¼
i
; (56)
∂θ ∂ηj
are bi-orthogonal, hei ; ej i ¼ δji .
(4) The cubic tensor is given by
The canonical divergence in a dually flat manifold of probability distributions is given by (27)
and is the KL-divergence. The KL-divergence has been used in various fields such as statistics,
information theory and statistical physics as a convenient measure of discrepancy between two
distributions without any explicitly shown reasons. The present theory gives a justification for
the KL-divergence, because it is the canonical divergence of a dually flat manifold. There are
many divergences including one proposed by Rea (Burbea & Rao, 1982). They are not
invariant, having the Riemannian metrics different from the Fisher information.
We remark that there are dually flat statistical manifolds other than the exponential family. A
mixture family of probability distributions given by
( )
Xm X
M ¼ pðx; ηÞ ¼ ηi pi ðxÞ: ηi ¼ 1 ; (60)
i¼0
where p0(x), … , pm(x) are linearly independent prescribed probability distributions. A mixture
family is not exponential family in general but is dually flat. Banerjee et al. (2005) showed that,
under a certain regularity condition, a dually flat manifold M is related to an exponential family
provided the inverse Laplace transform of ψðθÞ exists.
Given a convex function ψðθÞ, we can introduce a dually flat geometrical structure in which θ
is a flat coordinate system and its dual is given by the Legendre transform
η ¼ ∂i ψðθÞ: (61)
θ_ ¼ θR θQ : (65)
_
ηðtÞ ¼ ηQ ηP : (67)
Theorem 4. Pythagorean theorem (Figure 3). When the m-geodesic connecting P and Q is
orthogonal to the e-geodesic connecting Q and R,
where D is the canonical divergence. Dually, when the e-geodesic connecting P and Q is
orthogonal to the m-geodesic connecting Q and R
The projection theorem follows directly from the Pythagorean theorem. Let S be a smooth
submanifold in a dually flat manifold M. Let P be a point outside S. We search for the minimiser
P^ ∈ S of D[P : Q], Q ∈ S, or dually the minimiser of D½Q: P ¼ D∗ ½P: Q.
Theorem 5. Projection theorem (Figure 4). The minimiser of D[P : Q], Q ∈ S is obtained by
m-projecting P to S such that m-geodesic connecting P and P^ is orthogonal to S.
Dually, the minimiser of D∗[P : Q] is obtained by e-projecting P to S such that the e-geodesic
connecting P and P^ ∗ is orthogonal to S. The m-projection is unique, when S is e-flat and the
e-projection is unique, when S is m-flat. The proof is immediate from the Pythagorean theorem.
The two theorems are useful in various applications (Amari, 2016).
1X
η^ ¼ xi (70)
n
in terms of the dual parameter η. We call the point η^ ∈ M the observed point given directly from
the data D. The empirical distribution of D does not belong to M but the observed point lies in
M. The observed point converges to the true point as n tends to infinity.
We now consider statistical inference in a curved exponential family qðx; uÞ specified by
parameters u ¼ ðua Þ ¼ ðu1 ; …; ur Þ , r < m. It is a submodel embedded in M, called an
(m, r)-exponential family.
arg min
u^ ¼ DKL ½η: S: (72)
S
Because of the Pythagorean theorem, it is given by the m-projection of the observed point η^ to
S. It is Fisher efficient, and the asymptotic variance of the MLE is given by
1
Cov½u^ ¼ Bia gij Bjb (73)
n
where matrix B is
∂θi ðuÞ
Bia ¼ : (74)
∂ua
The matrix
For higher-order asymptotic theory of statistical testing; see Kumon & Amari (1983) and
Amari & Nagaoka (2000).
is specified by a parameter of interest u, but the form of p is unknown, and it has infinite degrees
of freedom. Here, we assume that u is a scalar, but it is easy to generalise to the vector case.
A simple example is a location model,
having two types of parameters u and ξ. Here, u is the parameter of interest and ξ is the nuisance
parameter which we do not care. Let D ¼ fx1 ; …; xn g be n independent observations, where xi
is an observation from qðx; u; ξ i Þ . Here, the parameter of interest u is common for all
observations but ξ i may differ each time. This model is proposed by Neyman & Scott (1948),
in which they showed that the MLE is not necessarily consistent nor efficient. Then, what is
the best estimator in this situation in the asymptotic sense? The Neyman–Scott problem had
bothered statisticians for many years until Bickel et al. (1994) and Amari & Kawanabe (1997)
presented convincing theories.
It is convenient to assume that unknown values ξ i of the nuisance parameter are randomly
generated from an unknown probability distribution k(ξ). Then, each xi is regarded as an iid
random variable generated from a distribution belonging to the model
SM includes the nuisance parameter k(ξ) of function degrees of freedom and is called a
semiparametric model (Begun et al., 1983).
SM has function degrees of freedom, so we need to treat a function space of probability
distributions. We do not present a rigorous geometrical theory, which has not yet been
completed (Ay et al., 2017; Pistone & Sempi, 1995). Instead, we give an intuitive arguments,
without specifying the conditions under which our theory holds. But the theory is useful for
practical applications.
Let us consider the tangent space of SM at point (u, k). Since a small deviation in log
probability
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
264 S. AMARI
satisfies
E½wðxÞ ¼ 0; (83)
we consider
where E is expectation with respect to pðx; u; kÞ, as the tangent space of SM at (u, k), provided
h i
E fwðxÞg2 < ∞ (85)
is satisfied. Then the tangent vectors wðxÞ form a Hilbert space. The inner product of two
tangent vectors w1 ðxÞ and w2 ðxÞ is given by
where k = c(t) is a curve in SM passing through (u, k) at t ¼ 0. This is the nuisance tangent space
of infinite dimensions. The above two do not cover entire Tu, k, and it includes other vectors aðxÞ
which are orthogonal to both Tu and Tk. They form a subspace called auxiliary tangent subspace
Ta. The tangent space is decomposed in a direct sum,
T u; k ¼ T u ⊕ T k ⊕ T a ; (89)
where Eu, ξ represents expectation with respect to qðx; u; ξÞ for any ξ. Note that (90) and (91)
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 265
hold for expectation with respect to any pðx; u; kÞ, which is a linear combination of qðx; u; ξÞ s.
An estimator u^ is obtained from
X
f ðxi ; uÞ ¼ 0; (92)
where the expectation in (90) is replaced by the arithmetic sum. It is easy to prove that the
estimator is consistent and asymptotically Gaussian having the variance
h i
2
1 E f f ðx; uÞg
V½u^ ¼ ; (93)
n A2
in a similar way as we prove that the MLE is asymptotically Gaussian (Lindsay, 1982).
We need to show when an estimating function exists and which is better when there are many.
In order to characterise the estimation function, we use parallel transports of tangent vectors.
The e andm-transports of a tangent vector wðxÞ from ðu; k 1 Þ to ðu; k 2 Þ are defined,
respectively, by
e
∏kk 21 wðxÞ ¼ wðxÞ Eu; k 2 ½wðxÞ; (94)
m pðx; u; k 1 Þ
∏kk 21 wðxÞ ¼ wðxÞ: (95)
pðx; u; k 2 Þ
It is easy to see that the parallel transports keep the inner product invariant in the following
sense:
e m
hw1 ðxÞ; w2 ðxÞiu; k 1 ¼ h∏kk 21 w1 ðxÞ; ∏kk 21 w2 ðxÞiu; k 2 : (96)
and is invariant under the e-parallel transport from any ðu; k 1 Þ to ðu; k 2 Þ,
e
∏kk 21 f ðx; uÞ ¼ f ðx; uÞ: (99)
d
sðx; u; kÞ ¼ log pðx; u; kÞ (100)
du
be the score vector and let sI ðx; u; kÞ be its projection to the subspace orthogonal to Tk. We call it an
information score. We then have the following theorem.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
266 S. AMARI
This proves that an estimating function exists when the information score is non-vanishing.
The optimal estimating function is sI ðx; u; kÞ, which we do not know since k is unknown.
However, this gives good means of selecting an estimating function. Even when we choose k
incorrectly, it is still an estimating function. When the true nuisance function is k0, the
optimal estimating function is sI ðx; u; k 0 Þ. An estimating function using the incorrect k is
written as
by using some auxiliary aðxÞ, and it still gives a consistent estimator. This is not true when we use
statistical model p(x, u, k) for estimating u with the incorrect k.
where v is the volume and w is its weight of a specimen. We assume we have the following
observations ðxi ; yi Þ, where xi are noisy observations of volumes and yi are noisy observations
of the weights of various specimens. They are given by
xi ¼ ξ i þ ε i ; (104)
yi ¼ uξ i þ ε0i ; (105)
where εi and εi0 are subject to a Gaussian distribution with mean 0 and variance σ 2
independently.
The joint distribution of x ¼ ðx; yÞ is written as
1 1 h 2 2
i
qðx; y; u; ξÞ ¼ exp ðx ξÞ þ ðy uξÞ : (106)
2πσ 2 2σ 2
We rewrite (106) as
where we put
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 267
1 2
rðx; yÞ ¼ x þ y2 : (109)
2
Since ξ is a random variable subject to k(ξ) in SM, we can study the conditional distribution of
ξ given s in (106). This is given by
kðξÞexpfξs þ r ψ g
pðξjsÞ ¼ : (110)
pðx; u; kÞ
We have
where h is a function depending on k. Using this, we have an explicit form of the nuisance
tangent space Tk. It is spanned by s irrespective of k,
Hence, it is invariant under the parallel transport from k1 to k2. We further calculate the
information score, obtaining
where h(z) is an arbitrary function. The best function h depends on the unknown k.
There are a number of estimators in this specific problem. We show some of them in the
following:
This is the estimator maximising log likelihood of all observations with respect to all the
unknown parameters u and ξ 1, … , ξ n. This minimises the sum of squares of the lengths of lines
that project the observed ðxi ; yi Þ to the regression line
y ¼ ux: (117)
3) Ratio of the sum of total weight to the sum of total volume, simply given by
P
y
u^ ¼ P i : (119)
xi
The least square solution is not good in this case, since it does not give even a consistent
estimator. Both MLE and the ratio estimators are consistent. Which is better? It depends on
the distribution k(ξ). Roughly speaking, when the average of k(ξ) is much larger than its
standard deviation, MLE is better, but when it is smaller, the ratio estimator is better.
Amari & Kawanabe (1997) proposed an estimator of which estimating function includes a
parameter c,
When c ¼ 0, it gives MLE. When c → ∞, it gives the ratio estimator. They proposed a method
of determining c from observed data.
where x ¼ xð1Þ ; …; xðnÞ , θ ¼ θð1Þ ; …; θðnÞ . The corresponding η -coordinates are η ¼
ηð1Þ ; …; ηðnÞ . Let Sk be a submodel defined by
n o
S k ¼ pðx; θÞθðk þ 1Þ ¼ … ¼ θðnÞ ¼ 0 : (122)
ηj (j ≠ i), or more precisely the tangent vector ei along the coordinate θi is orthogonal to
the tangent vector ej along the coordinate ηj (j ≠ i), parameters Θk + 1 are orthogonal to
parameters Hk.
It is convenient to define a new coordinate system, Ξ ¼ H k ; Θk þ 1 , which consists of the
lower part of η coordinates and the higher part of θ coordinates. We call it a mixed coordinate
system (Amari, 2001). Let us consider a submanifold specified by
H k ¼ cð1Þ ; …; cðkÞ (123)
for fixed vectors cð1Þ ; …; cðkÞ . Complementary, we consider a submanifold specified by
Θk þ 1 ¼ d ðk þ 1Þ ; … ; d ðnÞ for fixed vectors d ðk þ 1Þ ; …; d ðnÞ . These two submanifolds are
orthogonal to each other for any c and d. Note that two submaifolds Hk and Hk + 1 or Θk and
Θk + 1 are not orthogonal.
We consider a simple binary model for x ¼ ðx1 ; …; xn Þ, each xi taking 0 or 1. Given an
observed contingency table T ðx1 ; x2 ; …; xn Þ, we have interest in knowing how xis are mutually
interrelated in probability distribution pðx1 ; … ; xn Þ. There are pairwise interactions, three-way
interactions and further higher-order interactions among variables x1, … , xn. How can we
quantify these interactions? The mixture coordinates and the k-cuts are useful for defining the
degrees of interactions among random variables. We show that the interaction terms can be
decomposed orthogonally into pairwise, triplewise and higher-order ones.
We use a neural network of binary neurons as a typical example for intuitive explanation. The
network consists of n connected neurons interacting one another. Each neuron takes two states,
excited or quiescent. Let x1, … , xn be n binary random variables, where xi ¼ 1 represents that
the i-th neuron is excited and xi ¼ 0 quiescent. The current state x ¼ ðx1 ; …; xn Þ is regarded as
a vector random variable, and its joint probability distribution p(x1, … , xn) is shown in the form
of an exponential family,
nX X o
pðx; θÞ ¼ exp θ i xi þ θij xi xj þ ⋯þθ12⋯n x1 ⋯xn ψðθÞ : (124)
Its e-coordinates are organised as θ ¼ Θ1 ; Θ2 ; …; Θn , where subvectors Θk ¼ θi1 …ik ;
i1 < … < ik , summarise the k-th order terms.
The corresponding m coordinates are H k ¼ ηi1 …ik , where ηi1 …ik ¼ E½xi1 …xik . η is
decomposed as
η ¼ ðH 1 ; … ; H n Þ: (125)
The subvector Hk represents the probability of k neurons jointly firing, that is, the probability
of xi1 xi2 …xik ¼ 1. It is important to see that the coordinate axes of Hk are orthogonal to those of
Θk0 when k0 > k. A submodel
S k ¼ θΘk þ 1 ¼ … ¼ Θn ¼ 0 (126)
When θ12 ¼ 0, two neurons fire independently, pðxÞ ¼ p1 ðx1 Þp2 ðx2 Þ and there are no mutual
interactions between the two. When two neurons are not independent, there is mutual
interaction. We want to know its degree. The covariance σ 2 ¼ E½x1 x2 E½x1 E½x2 is 0, when
the two are independent, which clearly shows a degree of mutual interaction. However, there are
many such quantities that vanish when the two neurons fire independently. The increase in log
likelihood, due to an increase in firing rates, is correlated to the increase of log likelihood due to
the covariance. So the firing rates and the covariance are not separated. Geometrically speaking,
we want to have such a quantity that the direction due to an increase in interaction is orthogonal
to the directions due to increases in firing rates. This quantity represents the degree of
interaction that does not change even when the firing rates change. The covariance is not such
a quantity.
We know that the e-coordinate θ12 is orthogonal to firing rates η1 and η2, so it represents the
interaction orthogonal to the firing rates. As is seen in Figure 5, S1 defined by θ12 ¼ 0 is an e-flat
submanifold consisting of all the independent distributions.
Given a general distribution, we m-project it to S1. Then, we have a distribution in S1 which
has the same firing rates but no interaction. The submanifold M1 defined by η1 ¼ c1 ; η2 ¼ c2,
for constants c1 and c2, is m-flat, consisting of the distributions having the same firing rates c1
and c2 but the degree of interaction θ12 is different. They are orthogonal intersecting at ηi ¼ ci,
θ12 ¼ 0.
Let p0 be the distribution specified by θ ¼ 0, that is η1 ¼ η2 ¼ 1=2 and θ12 ¼ 0 . The
KL-divergence from p to p0 is decomposed as
The first term shows the effect of deviations of firing rates from 1/2, that is, the
KL-divergence from p^ to p0. The second represents the effect of interaction shown by
the KL-divergence from p to the independent model p^.
We next consider the case of n ¼ 3 neurons. It has a hierarchical structure S1 ⊂ S2 ⊂ S3. The
S1 is an independent model,
and ηis show the firing rates of neurons. Since directions of θ ij, θ123 are orthogonal to the
directions of ηi, they together represent degrees of neural interactions
n o to the firing
orthogonal
rates. Further, consider submanifold specified by M 2 ¼ η ηi ¼ ci ; ηij ¼ cij , for fixed ci
and cij. The pair ηi ; ηij represent both the firing rates and pairwise joint firing rates, but do
not specify the triple firing rate, η123 ¼ E½x1 x2 x3 . The quantity that is orthogonal to both
the firing rates and pairwise firing rates is given by θ123. So,
p111 p100 p010 p001
θ123 ¼ log (130)
p110 p101 p011 p000
represents the triplewise interaction orthogonal to the firing and pairwise firing rates.
A distribution p is m-projected to S2 and S1, where the KL-divergence is decomposed as
DKL ½p: p0 ¼ DKL ½p: p^2 þ DKL ½p^2 : p^1 þ DKL ½p^1 : p^0 (131)
where p^n ¼ p, p^k is the projection of p to Sk and p^0 ¼ p0 . This shows that DKL p^k þ 1 : p^k
represents the effect of the order k interactions. We show only binary models, but we can
generalise the theory to any hierarchical dually flat models.
6 Conclusions
Since Rao’s proposal three quarters of a century ago, information geometry, the geometrical
theory of a manifold of probability distributions has been developed widely, giving useful tools
for various fields related to probability. This paper reviews only some of interesting structures to
which information geometry gives remarkable contributions, such as dually flat structures and
discusses some applications.
There are some new developments in this area. Wong (2018) gives a new mathematical idea,
which generalises the Legendre transformation, and presents a new theory applicable to
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
272 S. AMARI
projectively flat manifolds. The Pythagorean and projection theorems hold, where the Rényi
divergence plays the role of the canonical divergence in a dually projectively flat manifold.
The Wasserstein distance gives another type of divergence in a manifold of probability
distributions and has broad applicability. The distance is responsible for the metric structure
of the base space on which distributions are defined. However, it is not invariant. It is interesting
to find a theory connecting them (Amari et al., 2019). Li & Zhou (2019) proposed a fundamen-
tal theory unifying the two geometries. See Amari & Matsuda (2021) for more detail on
Wasserstain statistics. Amari (2016) discusses various applications.
References
Amari, S. 1982. Differential geometry of curved exponential families—curvature and information loss. Ann. Stat., 10,
357–385.
Amari, S. 1985. Differential-Geometrical Methods in Statistics, Lecture Notes in Statistics, Vol. 28. Springer.
Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory, 47,
1701–1711.
Amari, S. 2016. Information Geometry and Its Applications. Springer.
Amari, S., Karakida, R., Oizumi, M. & Cuturi, M. 2019. Information geometry for regularized optimal transport and
barycenters of patterns. Neural Comput., 31, 827–848.
Amari, S. & Kawanabe, M. 1997. Information geometry of estimating functions in semi-parametric statistical models.
Bernoulli, 3, 29–54.
Amari, S. & Matsuda, T. 2021. Wasserstein statistics in 1D location-scale model. Annals of Institute of Statistical
Mathematics.
Amari, S. & Nagaoka, H. 2000. Methods of information geometry. American Mathematical Society and Oxford
University Press.
Ay, N., Jost, J., Lé, H.V. & Schwanchhöfer, L. 2017. Information Geometry. Springer.
Bai, Z.D., Rao, C.R. & Wu, Y. 1982. M-estimation of multivariate linear regression parameters under a convex
discrepancy function. Stat. Sin., 2, 237–264.
Banerjee, A., Merugu, S., Dhillon, I. & Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res.,
6, 1705–1749.
Barndorff-Nielsen, O.E. 1978. Information and Exponential Families in Statistical Theory. Wiley.
Begun, J.M., Hall, W.J., Huang, W.M. & Wellner, J.A. 1983. Information and asymptotic efficiency in
parametric-nonparametric models. Ann. Stat., 11, 432–452.
Bickel, P.J., Ritov, C.A.J. & Wellner, J.A. 1994. Efficient and Adaptive Estimation for Semiparametric Models. Johns
Hopkins University Press.
Burbea, J. & Rao, C.R. 1982. Entropy differential metric, distance and divergence measures in probability spaces: a
unified approach. J. Multivar. Anal., 12, 575–596.
Chentsov, N.N. 1982. Statistical decision rules and optimal inference. Nauka, 1972 (in Russian; English Translation,
AMS.
Cramér, H. 1946. Mathematical Methods of Statistics. Princeton University Press.
Csiszár, I. 1974. Information measures: a critical survey, pp. 83–86, Proc. 7th Conf. Inf. Theory, Prague, Czech
Republic.
Dawid, A.P 1975. (Discussions to B. Efron).
Efron, B. 1975. Defining the curvature of a statistical problem (with application to second order efficiency). Ann. Stat.,
3, 1189–1242.
Eguchi, S. 1983. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.,
11, 793–803.
Godambe, V.P. 1991. Estimating Functions. Oxford University Press.
Kumon, K. & Amari, S. 1983. Geometrical theory of higher-order asymptotics of test, interval estimator and condi-
tional inference. Proc. Royal Society of London, A, 387, 429–458.
Li, W. & Zhou, W. 2019. Wasserstein information matrix. arXiv.
Lindsay, B. 1982. Conditional score functions: some optimality results. Biometrika, 69, 503–512.
Morimoto, T. 1963. Markov processes and the H-theorem. J. Phys. Soc Jap., 12, 328–331.
Nagaoka, H. & Amari, S. 1982. Differential geometry of smooth families of probability distributions. In METR 82-7,
UTokyo.
International Statistical Review (2021), 89, 2, 250–273
© 2021 International Statistical Institute.
17515823, 2021, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12464 by Fujita Health University, Wiley Online Library on [18/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Information Geometry 273
Neyman, J. & Scott, E.L. 1948. Consistent estimates based on partially consistent observation. Econometrika, 16,
1–32.
Pistone, G. & Sempi, C. 1995. An infinite-dimensional geometric structure on the space of all the probability
measures equivalent to a given one. Ann. Stat., 23, 1543–1561.
Rao, C.R. 1945. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math.
Soc., 37, 81–91.
Rao, C.R. 1962. Efficient estimators and optimum inference in large samples. J. R. Stat. Soc. B., 24, 46–72.
Rao, C.R., Sinha, B.K. & Subramanyam, K. 1982. Third order efficiency of the maximum likelihood estimator in the
multinomial distributions. Stat. Decisions, 1, 1–16.
Wong, T.-K.L. 2018. Logarithmic divergence from optimal transport and Renyi geometry. Inf. Geom., 1, 39–78.