0% found this document useful (0 votes)
192 views79 pages

Yunshu InformationGeometry PDF

This document provides an introduction to information geometry. It begins with an outline of topics to be covered, including an introduction to differential geometry and the geometric structure of statistical models and statistical inference. The first part defines basic concepts in differential geometry, such as manifolds, submanifolds, coordinate systems, curves and tangent vectors of curves. It provides examples of manifolds including spheres, tori and color models.

Uploaded by

matrazzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views79 pages

Yunshu InformationGeometry PDF

This document provides an introduction to information geometry. It begins with an outline of topics to be covered, including an introduction to differential geometry and the geometric structure of statistical models and statistical inference. The first part defines basic concepts in differential geometry, such as manifolds, submanifolds, coordinate systems, curves and tangent vectors of curves. It provides examples of manifolds including spheres, tori and color models.

Uploaded by

matrazzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Introduction to Information Geometry

– based on the book “Methods of Information Geometry ”written by


Shun-Ichi Amari and Hiroshi Nagaoka

Yunshu Liu

2012-02-17
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Outline

1 Introduction to differential geometry


Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel

2 Geometric structure of statistical models and statistical inference


The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference

Yunshu Liu (ASPITRG) Introduction to Information Geometry 2 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Part I

Introduction to differential geometry

Yunshu Liu (ASPITRG) Introduction to Information Geometry 3 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Basic concepts in differential geometry

Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel

Yunshu Liu (ASPITRG) Introduction to Information Geometry 4 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Manifold

Manifold S
Manifold: a set with a coordinate system, a one-to-one mapping from S to Rn ,
supposed to be ”locally” looks like an open subset of Rn ”
Elements of the set(points): points in Rn , probability distribution, linear
system.

Figure : A coordinate system ξ for a manifold S

Yunshu Liu (ASPITRG) Introduction to Information Geometry 5 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Manifold

Manifold S
Definition: Let S be a set, if there exists a set of coordinate systems A for S
which satisfies the condition (1) and (2) below, we call S an n-dimensional
C∞ differentiable manifold.
(1) Each element ϕ of A is a one-to-one mapping from S to some open
subset of Rn .
(2) For all ϕ ∈ A, given any one-to-one mapping ψ from S to Rn , the
following hold:
ψ ∈ A ⇔ ψ ◦ ϕ−1 is a C∞ diffeomorphism.
Here, by a C∞ diffeomorphism we mean that ψ ◦ ϕ−1 and its inverse ϕ ◦ ψ −1
are both C∞ (infinitely many times differentiable).

Yunshu Liu (ASPITRG) Introduction to Information Geometry 6 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Manifold

Examples of one-dimensional manifold


A straight line: a manifold in R1 , even if it is given in Rk for ∀ k > 2.
Any open subset of a straight line: one-dimensional manifold
A closed subset of a straight line: not a manifold
A circle: locally the circle looks like a line
Any open subset of a circle: one-dimensional manifold
A closed subset of a circle: not a manifold.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 7 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Manifold: surface of a sphere

Surface of a sphere in R3 , defined by S = {(x, y, z) ∈ R3 |x2 + y2 + z2 = 1},


locally it can be parameterized by using two coordinates, for example, we can
use latitude and longitude as the coordinates.
nD sphere(n-1 sphere): S = {(x1 , x2 , ..., xn ) ∈ Rn |x12 + x22 + ... + xn2 = 1}.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 8 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Manifold: surface of a torus

The torus in R3 (surface of a doughnut):


x(u, v) = ((a + b cos u)cos v, (a + b cos u)sin v, b sin u), 0 6 u, v < 2π.
where a is the distance from the center of the tube to the center of the torus,
and b is the radius of the tube. A torus is a closed surface defined as product
of two circles: T 2 = S1 × S1 .
n-torus: T n is defined as a product of n circles: T n = S1 × S1 × · · · × S1 .

Yunshu Liu (ASPITRG) Introduction to Information Geometry 9 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Coordinate systems for manifold

Parametrization of unit hemisphere


Parametrization: map of unit hemisphere into R2
(1) by latitude and longigude;

x(ϕ, θ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ), 0 < ϕ < π/2, 0 6 θ < 2π (1)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 10 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Coordinate systems for manifold

Parametrization of unit hemisphere


Parametrization: map of unit hemisphere into R2
(2) by stereographic projections.

2u 2v 1 − u2 − v2
x(u, v) = ( , , ) where u2 + v2 6 1 (2)
1 + u2 + v2 1 + u2 + v2 1 + u2 + v2

Yunshu Liu (ASPITRG) Introduction to Information Geometry 11 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Manifold: colors

Parametrization of color models


3 channel color models: RGB, CMYK, LAB, HSV and so on.
The RGB color model: an additive color model in which red, green, and
blue light are added together in various ways to reproduce a broad array
of colors.
The Lab color model: three coordinates of Lab represent the lightness of
the color(L), its position between red and green(a) and its position
between yellow and blue(b).

Yunshu Liu (ASPITRG) Introduction to Information Geometry 12 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Coordinate systems for manifold

Parametrization of color models


Parametrization: map of color into R3
Examples:

Yunshu Liu (ASPITRG) Introduction to Information Geometry 13 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Submanifolds

Submanifolds
Definition: a submanifold M of a manifold S is a subset of S which itself has
the structure of a manifold
An open subset of n-dimensional manifold forms an n-dimensional
submanifold.
One way to construct m(<n)dimensional manifold: fix n-m coordinates.

Examples:

Yunshu Liu (ASPITRG) Introduction to Information Geometry 14 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Submanifolds

Examples: color models


3-dimensional submanifold: any open subset;
2-dimensional submanifold: fix one coordinate;
1-dimensional submanifold: fix two coordinates.
Note: In Lab color model, we set a and b to 0 and change L from 0 to
100(from black to white), then we get a 1-dimensional submanifold.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 15 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Basic concepts in differential geometry

Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel

Yunshu Liu (ASPITRG) Introduction to Information Geometry 16 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Curves and Tangent vector of Curves

Curves
Curve γ: I → S from some interval I(⊂ R) to S.
Examples: curve on sphere, set of probability distribution, set of linear
systems.
Using coordinate system {ξ i } to express the point γ(t) on the curve(where t ∈
I): γ i (t) = ξ i (γ(t)), then we get γ̄(t) = [γ 1 (t), · · · , γ n (t)].

C∞ Curves
C∞ : infinitely many times differentiable(sufficiently smooth).
If γ̄(t) is C∞ for t ∈ I, we call γ a C∞ on manifold S.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 17 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Tangent vector of Curves

A tangent vector is a vector that is tangent to a curve or surface at a given


point.
When S is an open subset of Rn , the range of γ is contained within a single
linear space, hence we consider the standard derivative:

γ(a + h) − γ(a)
γ̇(a) = lim (3)
h→0 h
In general, however, this is not true, ex: the range of γ in a color model
Thus we use a more general ”derivative” instead:
n
X ∂
γ̇(a) = γ̇ i (a)( )p (4)
∂ξ i
i=1

d i ∂
where γ i (t) = ξ i ◦ γ(t), γ̇ i (a) = dt γ (t)|t=a and ( ∂ξ i )p is an operator which
∂f
maps f → ( ∂ξ i )p for given function f : S → R.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 18 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Tangent space

Tangent space
Tangent space at p: a hyperplane Tp containing all the tangents of curves
passing through the point p ∈ S. (dim Tp (S) = dim S)
n
X ∂
Tp (S) = { ci ( i )p |[c1 , · · · , cn ] ∈ Rn }
∂ξ
i=1

Examples: hemisphere and color

Yunshu Liu (ASPITRG) Introduction to Information Geometry 19 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Vector fields

Vector fields
Vector fields: a map from each point in a manifold S to a tangent vector.
Consider a coordinate system {ξi } for a n-dimensional manifold, clearly

∂i = ∂ξ i are vector fields for i = 1, · · · , n.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 20 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Basic concepts in differential geometry

Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel

Yunshu Liu (ASPITRG) Introduction to Information Geometry 21 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Riemannian Metrics

0
Riemannian Metrics: an inner product of two tangent vectors(D and D ∈
0
Tp (S)) which satisfy h D, D ip ∈ R, and the following condition hold:
0 00 00 0 00
Linearity : haD + bD , D ip = ahD, D ip + bhD , D ip
0 0
Symmetry : hD, D ip = hD , Dip
Positive − definiteness : If D 6= 0 then hD, Dip > 0

The components {gij } of a Riemannian metric g w.r.t. the coordinate system



{ξi } are defined by gij = h ∂i , ∂j i, where ∂i = ∂ξ i
.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 22 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Riemannian Metrics

Examples of inner product:


For X = (x1 , · · · , xn ) P
and Y = (y1 , · · · , yn ), we can define inner product
as hX, Yi1 = X · Y = ni=1 xi yi , or hX, Yi2 = YMX, where M is any
symmetry positive-definite matrix.
For random variables X and Y, the expected value of their product:
hX, Yi = E(XY)
For square real matrix, hA, Bi = tr(ABT )

Yunshu Liu (ASPITRG) Introduction to Information Geometry 23 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Riemannian Metrics

For unit sphere:

x(ϕ, θ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ), 0 < ϕ < π, 0 6 θ < 2π (5)

we have:

∂ϕ = (cos ϕ cos θ, cos ϕ sin θ, −sin ϕ)


∂θ = (−sin ϕ sin θ, sin ϕ cos θ, 0)
g11 = h∂ϕ , ∂ϕ i, g22 = h∂θ , ∂θ i
g12 = g21 = h∂ϕ , ∂θ i = h∂θ , ∂ϕ i
 
Pn 2 0
If we define hX, Yi = YMX = 2 i=1 xi yi , where M = , then
0 2
   
g11 g12 2 0
(gi,j ) = = (6)
g21 g22 0 2sin2 ϕ
Yunshu Liu (ASPITRG) Introduction to Information Geometry 24 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

Parallel translation along curves


Let γ: [a, b] → S be a curve in S, X(t) be a vector field mapping each point
γ(t) to a tangent vector, if for all t ∈ [a, b] and the corresponding infinitesimal
dt, the corresponding tangent vectors are linearly related, that is to say there
exist a linear mapping Πp,p0 , such that X(t + dt) = Πp,p0 (X(t)) for t ∈ [a, b],
we say X is parallel along γ, and call Πγ the parallel translation along γ.
Linear mapping: additivity and scalar multiplication.

Figure : Translation of a tangent vector along a curve


Yunshu Liu (ASPITRG) Introduction to Information Geometry 25 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

Affine connection: relationships between tangent space at different points.


Recall:

Natural basis of the coordinate system [ξ i ]: (∂i )p = ( ∂ξ i )p : an operator
∂f
which maps f → ( ∂ξ i )p for given function f : S → R at p.

Tangent space:
n
X ∂
Tp (S) = { ci ( )p |[c1 , · · · , cn ] ∈ Rn }
∂ξ i
i=1

Tangent vector(elements in Tangent space) can be represented as linear


combinations of ∂i .

Tangent space Tp → Tangent vector Xp → Natural basis (∂i )p = ( ∂ξ i )p

Yunshu Liu (ASPITRG) Introduction to Information Geometry 26 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

0
If the difference between the coordinates of p and p are very small, that we
can ignore the second-order infinitesimals (dξ i )(dξ j ), where
0
dξ i = ξ i (p ) − ξ i (p), then we can express difference between Πp,p0 ((∂j )p ) and
((∂j )p0 ) as a linear combination of {dξ 1 , · · · , dξ n }:
X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 ) (7)
i,k

where {(Γkij )p ; i, j, k = 1, · · · , n} are n3 numbers which depend on the point p.


From X(t) = ni=1 X i (t)(∂i )p and X(t + dt) = ni=1 (X i (t + dt)(∂i )p0 ), we
P P
have X
Πp,p0 (X(t)) = ({X k (t) − dtγ̇ i (t)X j (t)(Γkij )p }(∂k )p0 ) (8)
i,j,k

Yunshu Liu (ASPITRG) Introduction to Information Geometry 27 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 )
i,k

Yunshu Liu (ASPITRG) Introduction to Information Geometry 28 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Connection coefficients(Christoffel’s symbols): (Γkij )p

Given a connection on the manifold S, the value of (Γkij )p are different for
different coordinate systems, it shows how tangent vectors changes on a
manifold, thus shows how basis vectors changes.
In X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 )
i,k

if we let Γkij = 0 for i, j, k = x, y, we will have

Πp,p0 ((∂j )p ) = (∂j )p0

Yunshu Liu (ASPITRG) Introduction to Information Geometry 29 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Connection coefficients(Christoffel’s symbols): (Γkij )p

Given a connection on a manifold S, Γki,j depend on coordinate system. Define


a connection which makes Γki,j to be zero in one coordinate system, we will
get non-zero connection coefficients in some other coordinate systems.

Example: If it is desired to let the connection coefficients for Cartesian


Coordinates of a 2D flat plane to be zero, Γkij = 0 for i, j, k = x, y, we can
calculate the connection coefficients for Polar Coordinates: Γϕ ϕ 1
rϕ = Γϕr = r ,
Γrϕϕ = −r, and Γkij = 0 for all others.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 30 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Connection coefficients(Christoffel’s symbols): (Γkij )p

Example(cont.):
Now if we want to let the connection coefficients for Polar Coordinates to be
zero, Γkij = 0 for i, j, k = r, ϕ, we can calculate the connection coefficients for
2 ϕ)
− sin2 ϕ cos ϕ
Polar Coordinates: Γxxx = r , Γyxx = sin ϕ(1+cos
r ,
− sin3 ϕ y y − cos3 ϕ cos ϕ(1+sin2 ϕ)
Γxxy = x
Γyx = r , Γxy = Γyx = r
x
, Γyy = r , and
− sin ϕ cos2 ϕ
Γyyy = r .

Yunshu Liu (ASPITRG) Introduction to Information Geometry 31 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

Covariant derivative along curves


Derivative: dX(t)
dt = limdt→0
X(t+dt)−X(t)
dt , what if X(t) and X(t + dt) lie in
different tangent spaces? Xt (t + dt) = Πγ(t+dt),γ(t) (X(t + dt))

δX(t) = Xt (t + dt) − X(t) = Πγ(t+dt),γ(t) (X(t + dt)) − X(t)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 32 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

Covariant derivative along curves


δX(t)
We call dt the covariant derivative of X(t):

δX(t) Xt (t + dt) − X(t) Πγ(t+dt),γ(t) (X(t + dt)) − X(t)


= lim = (9)
dt dt→0 dt dt
X
Πγ(t+dt) (X(t + dt)) = ({X k (t + dt) + dtγ̇ i (t)X j (t)(Γkij )γ(t) }(∂k )γ(t)
(10))
i,j,k
δX(t) X
= ({Ẋ k (t) + γ̇ i (t)X j (t)(Γkij )γ(t) }(∂k )γ(t) ) (11)
dt
i,j,k

Yunshu Liu (ASPITRG) Introduction to Information Geometry 33 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Affine connection

Covariant derivative of any two tangent vector


Pn i
Covariant
Pn derivative of Y w.r.t. X, where X = i=1 (X ∂i ) and
Y = i=1 (Y i ∂i ):
X
∇X Y = (X i {∂i Y k + Y j Γkij }∂k ) (12)
i,j,k
X n
∇ ∂i ∂ j = Γkij ∂k (13)
k=1

Note: (∇X Y)p = ∇Xp Y ∈ Tp (S)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 34 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Affine connection

metric connection
Definition: If for all vector fields X, Y, Z ∈ T (S),

ZhX, Yi = h∇Z X, Yi + hX, ∇Z Yi.

where ZhX, Yi denotes the derivative of the function hX, Yi along this vector
field Z, we say that ∇ is a metric connection w.r.t. g.
Equivalent condition: for all basis ∂i , ∂j , ∂k ∈ T (S),

∂k h∂i , ∂j i = h∇∂k ∂i , ∂j i + h∂i , ∇∂k ∂i i.

Property: parallel translation on a metric connection preserves inner products,


which means parallel transport is an isometry.

hΠγ (D1 ), Πγ (D2 )iq = hD1 , D2 ip .

Yunshu Liu (ASPITRG) Introduction to Information Geometry 35 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Affine connection

Levi-Civita connection
For a given connection, when Γkij = Γkji hold for all i, j and k, we call it a
symmetric connection or torsion-free connection.
From ∇∂i ∂j = nk=1 Γkij ∂k , we know for a symmetric connection:
P
∇ ∂i ∂ j = ∇ ∂j ∂ i
If a connection is both metric and symmetric, we call it the Riemannian
connection or the Levi-Civita connection w.r.t. g.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 36 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Basic concepts in differential geometry

Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel

Yunshu Liu (ASPITRG) Introduction to Information Geometry 37 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Flatness

Affine coordinate system


Let {ξ i } be a coordinate system for S, we call {ξ i } an affine coordinate system

for the connection ∇ if the n basis vector fields ∂i = ∂ξ i are all parallel on S.
Equivalent conditions for a coordinate system to be an affine coordinate
system:

∇∂i ∂j = nk=1 (Γkij ∂k ) = 0 for all i and j


P
(14)
Γkij = 0 for all i, j and k (15)

Flatness
S is flat w.r.t the connection ∇: an affine coordinate system exist for the
connection ∇.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 38 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Flatness

Examples:

Pn k
∇ ∂i ∂ j = k=1 (Γij ∂k ) = 0 for all i and j
Γkij = 0 for all i, j and k

Yunshu Liu (ASPITRG) Introduction to Information Geometry 39 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Flatness

Curvature R and torsion T of a connection


X X
R(∂i , ∂j )∂k = (Rlijk ∂l ) and T(∂i , ∂j ) = (Tijk ∂k ) (16)
l k

where Rlijk and Tijk can be computed in the following way:

Rlijk = ∂i Γljk − ∂j Γlik + Γlih Γhjk − Γljh Γhik (17)


Tijk = Γkij − Γkji (18)

If a connection is flat, then T=R=0;


If T= 0, Γkij = 0 for all i, j and k, we get the symmetry connnection.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 40 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Flatness

Curvature
Curvature R = 0 iff parallel translation does not depend on curve choice.
Curvature is independent of coordinate system, under Riemannian
connection, we can calculate:
Curvature of 2 dimensional plane: R = 0;
Curvature of 3 dimensional sphere: R = r22 .

Yunshu Liu (ASPITRG) Introduction to Information Geometry 41 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Autoparallel submanifold

Equivalent condition for a submanifold M of S to be autoparallel

∇X Y ∈ T (M) for ∀X, Y ∈ T (M) (19)


∇∂a ∂b ∈ T (M) for all a and b (20)
X
∇ ∂a ∂ b = (Γcab ∂c ) (21)
c

where ∂a = ∂u∂ a and ∂b = ∂


∂ub
are the basis for submanifold M w.r.t.
coordinate system {ui }.

Examples of autoparallel submanifold:


Open subsets of manifold S are autoparallel;
A curve with the properity that all the tangent vector are parallel

Yunshu Liu (ASPITRG) Introduction to Information Geometry 42 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Autoparallel submanifold

Geodesics
Geodesics(autoparallel curves): A curve with tangent vector transported by
parallel translation.
Examples under Riemannian connection:
2 dimensional flat plane: straight line
3 dimensional sphere: great circle

Yunshu Liu (ASPITRG) Introduction to Information Geometry 43 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Autoparallel submanifold

Geodesics
The geodesics with respect to the Riemannian connection are known to
coincide with the shortest curve joining two points.
Shortest curve: curve with the shortest length.
Length of a curve γ : [a, b] → S:
Z b Z bq

kγk = k kdt = gij γ̇ i γ̇ j dt (22)
a dt a

Yunshu Liu (ASPITRG) Introduction to Information Geometry 44 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Part II

Geometric structure of statistical models and statistical


inference

Yunshu Liu (ASPITRG) Introduction to Information Geometry 45 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Motivation

Motivation
Consider the set of probability distributions as a manifold.

Analysis the relationship between the geometric structure of the manifold and
statistical estimation.

Introduce concepts like metric, affine connection on statistical models and


studying quantities such as distance, the tangent space (which provides linear
approximations), geodesics and the curvature of a manifold.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 46 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Statistical models

Statistical models
Z
P(X ) = {p : X → R | p(x) > 0 (∀x ∈ X ), p(x)dx = 1} (23)

Example Normal Distribution:

X = R, n = 2, ξ = [µ, σ], Ξ = {[µ, σ]| − ∞ < µ < ∞, 0 < σ < ∞}


1 (x − µ)2
p(x, ξ) = √ exp−
2πσ 2σ 2

Yunshu Liu (ASPITRG) Introduction to Information Geometry 47 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Geometric structure of statistical models and statistical inference

Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference

Yunshu Liu (ASPITRG) Introduction to Information Geometry 48 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

The Fisher information matrix

Fisher information matrix G(ξ) = [gi,j (ξ)], and


Z
gi,j (ξ) = Eξ [∂i `ξ ∂j `ξ ] = ∂i `(x; ξ)∂j `(x; ξ)p(x; ξ)dx

where `ξ = `(x; ξ) = log p(x; ξ) and Eξ denotes the expectation w.r.t. the
distribution pξ .

Motivation:
Sufficient statistic and Cramér-Rao bound

Yunshu Liu (ASPITRG) Introduction to Information Geometry 49 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

The Fisher information matrix

Sufficient statistic
Sufficient statistic: for Y = F(X), given the distribution p(x; ξ) of X, we have
p(x; ξ) = q(F(x); ξ)r(x; ξ), if r(x; ξ) does not depend on ξ for all x, we say
that F is a sufficient statistic for the model S. Then we can write
p(x; ξ) = q(y; ξ)r(x).
A sufficient statistic is a function whose value contains all the information
needed to compute any estimate of the parameter (e.g. a maximum likelihood
estimate).

Fisher information matrix and sufficient statistic


Let G(ξ) be the Fisher information matrix of S = p(x; ξ), and GF (ξ) be the
Fisher information matrix of the induced model SF = q(y; ξ), then we have
GF (ξ) 6 G(ξ) in the sense that ∆G(ξ) = GF (ξ) − G(ξ) is positive
semidefinite. ∆G(ξ) = 0 iff. F is a sufficient statistic for S.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 50 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Cramér-Rao inequality

Cramér-Rao inequality
The variance of any unbiased estimator is at least as high as the inverse of the
Fisher information.
ˆ Eξ [ξ(X)]
Unbiased estimator ξ: ˆ =ξ
ˆ = [vij ] where
The variance-covariance matrix Vξ [ξ] ξ

vijξ = Eξ [(ξˆi (X) − ξ i )(ξˆj (X) − ξ j )]

ˆ > G(ξ)−1 , and an unbiased


Thus Cramér-Rao inequality state that Vξ [ξ]
ˆ ˆ −1
estimator ξ satisfying Vξ [ξ] = G(ξ) is called an efficient estimator.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 51 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

α-connection

α-connection
(α)
Let S = {pξ } be an n-dimensional model, and consider the function Γij,k
which maps each point ξ to the following value:

(α) 1−α
(Γij,k )ξ = Eξ [(∂i ∂j `ξ + ∂i `ξ ∂j `ξ )(∂k `ξ )] (24)
2

where α is an arbitrary real number. We defined an affine connection ∇(α)


which satisfy: D E
(α) (α)
∇∂i ∂j , ∂k = Γij,k (25)

where g = h, i is the Fisher metric. We call ∇(α) the α-connection

Yunshu Liu (ASPITRG) Introduction to Information Geometry 52 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

α-connection

Properties of α-connection
α-connection is a symmetric connection
Relationship between α-connection and β-connection:

(β) (α) α−β


Γij,k = Γij,k + E[∂i `ξ ∂j `ξ ∂k `ξ ]
2
The 0-connection is the Riemannian connection with respect to the
Fisher metric.

(β) (0) −β
Γij,k = Γij,k + E[∂i `ξ ∂j `ξ ∂k `ξ ]
2
1 + α (1) 1 − α (−1)
∇(α) = ∇ + ∇
2 2

Yunshu Liu (ASPITRG) Introduction to Information Geometry 53 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Geometric structure of statistical models and statistical inference

Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference

Yunshu Liu (ASPITRG) Introduction to Information Geometry 54 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Exponential family

Exponential family
n
X
p(x; θ) = exp[C(x) + θi Fi (x) − ψ(θ)]
i=1

[θi ] are called the natural parameters(coordinates), and ψ is the potential


function for [θi ], which can be calculated as
Z n
X
ψ(θ) = log exp[C(x) + θi Fi (x)]dx
i=1

The exponential families include many of the most common distributions,


including the normal, exponential, gamma, beta, Dirichlet, Bernoulli,
binomial, multinomial, Poisson, and so on.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 55 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Exponential family

Exponential family
Examples: Normal Distribution

1 (x−µ)2

p(x; µ, σ) = √ e 2σ2 (26)
2πσ
µ
where C(x) = 0, F1 (x) = x, F2 (x) = x2 , and θ1 = σ2
, θ2 = − 2σ1 2 are the
natural parameters, the potential function is :

(θ1 )2 1 π µ2 √
ψ=− 2
+ log(− 2
) = 2
+ log( 2πσ) (27)
4θ 2 θ 2σ

Yunshu Liu (ASPITRG) Introduction to Information Geometry 56 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Mixture family

Mixture family
n
X
p(x; θ) = C(x) + θi Fi (x)
i=1

In this case we say that S is a mixture family and [θi ] are called the mixture
parameters.

e-connection and m-connection


The natural parameters of exponential family form a 1-affine coordinate
(1)
system(Γij,k = 0), which means the connection is 1-flat, we call the
connection ∇(1) the e-connection, and call exponential family e-flat.
The mixture parameters of mixture family form a (-1)-affine coordinate
(−1)
system(Γij,k = 0), which means the connection is (-1)-flat, and we call the
connection ∇(−1) the m-connection and call mixture family m-flat.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 57 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Dual connection

Dual connection
Definition: Let S be a manifold on which there is given a Riemannian metric
g and two affine connection ∇ and ∇∗ . If for all vector fields X, Y, Z ∈ T (S),

Z < X, Y >=< ∇Z X, Y > + < X, ∇∗Z Y > (28)

hold, we say that ∇ and ∇∗ are duals of each other w.r.t. g and call one the
dual connection of the other.
Additional, we call the triple (g, ∇, ∇∗ ) a dualistic structure on S.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 58 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Dual connection

Properties
For any statistical model, the α-connection and the (−α)-connection are
dual with respect to the Fisher metric.

hΠγ (D1 ), Π∗γ (D2 )iq = hD1 , D2 ip .


where Πγ and Π∗γ are parallel translation along γ w.r.t. ∇ and ∇∗ .

R = 0 ⇔ R∗ = 0
where R and R∗ are the curvature tensors of ∇ and ∇∗ .

Yunshu Liu (ASPITRG) Introduction to Information Geometry 59 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Dually flat spaces and dual coordinate system

Dually flat spaces


Let (g, ∇, ∇∗ ) be a dualistic structure on a manifold S, then we have
R = 0 ⇔ R∗ = 0, and if the connnection ∇ and ∇∗ are both
symmetric(T = T ∗ = 0), then we see that ∇-flatness and ∇∗ -flatness are
equivalent.
We call (S, g, ∇, ∇∗ ) a dually flat space if both duals ∇ and ∇∗ are flat.
Examples: Since α-connections and −α-connections are dual w.r.t. Fisher
metric and α-connections are symmetry, we have for any statistical model S
and for any real number α

S is α − flat ⇔ S is (−α) − flat (29)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 60 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Dually flat spaces and dual coordinate system

Dual coordinate system


For a particular ∇-affine coordinate system [θi ], if we choose a corresponding
∇∗ -affine coordinate system [ηj ] such that

g = h∂i , ∂ j i = δij
∂ j ∂
where ∂i = ∂θ i and ∂ = ∂η .
j
Then we say the two coordinate systems mutually dual w.r.t. metric g, and
call one the dual coordinate system of the other.

Existence of dual coordinate system


A pair of dual coordinate system exist if and only if (S, g, ∇, ∇∗ ) is a dually
flat space.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 61 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Legendre transformations

Consider mutually dual coordinate system [θi ] and [ηi ] with functions
ψ : S → R and ϕ : S → R satisfy the following equations:

∂i ψ = η i
∂ i ϕ = θi
gi,j = ∂i ηj = ∂j ηi = ∂i ∂j ψ
ϕ(η) = maxθ {θi ηi − ψ(θ)}
ψ(θ) = maxη {θi ηi − ϕ(η)}

Yunshu Liu (ASPITRG) Introduction to Information Geometry 62 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Legendre transformations

Geometric interpretation for


f ∗ (p) = maxx (px − f (x)):
A convex function f (x) is shown in red, and
the tangent line at point (x0 , f (x0 )) is shown
in blue. The tangent line intersects the
vertical axis at (0, −f ∗ ) and f ∗ is the value of
the Legendre transform f ∗ (p0 ), where
p0 = ḟ (x0 ). Note that for any other point on
the red curve, a line drawn through that point
with the same slope as the blue line will have
a y-intercept above the point (0, −f ∗ ),
showing that is indeed a maximum.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 63 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Examples of Legendre transformations

Examples:
The Legendre transform of f (x) = 1p |x|p (where 1 < p < ∞) is
f ∗ (x∗ ) = 1q |x∗ |q (where 1 < q < ∞),
The Legendre transform of f (x) = ex is f ∗ (x∗ ) = x∗ ln x∗ − x∗ (where
x∗ > 0),
The Legendre transform of f (x) = 12 xT Ax is f ∗ (x∗ ) = 21 x∗T A−1 x∗ ,
The Legendre transform of f (x) = |x| is f ∗ (x∗ ) = 0 if x∗ 6 1, and
f ∗ (x∗ ) = ∞ if x∗ > 1.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 64 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

The natural parameter and dual parameter of Exponential family

For distribution p(x; θ) = exp[C(x) + ni=1 θi Fi (x) − ψ(θ)], [θi ] are


P
called the natural parameters
R
If we define ηi = Eθ [Fi ] = Fi (x)p(x; θ)dx, we can verify [ηi ] is a
(-1)-affine coordinate system dual to [θi ], we call this [ηi ] the expectation
parameters or the dual parameters.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 65 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

The natural parameter and dual parameter of Exponential family

Recall: Normal Distribution


1 (x−µ)2

p(x; µ, σ) = √ e 2σ2 (30)
2πσ
µ
where C(x) = 0, F1 (x) = x, F2 (x) = x2 , and θ1 = σ2
, θ2 = − 2σ1 2 are the
natural parameters, the potential function is :

(θ1 )2 1 π µ2 √
ψ=− 2
+ log(− 2
) = 2
+ log( 2πσ) (31)
4θ 2 θ 2σ
∂ψ θ 1
The dual parameter are calculated as η1 = ∂θ1
= µ = − 2θ 2,
∂ψ (θ1 )2 −2θ2
η2 = ∂θ2
= µ2 + σ 2 = 4(θ2 )2
, It has potential function:

1 π 1
ϕ = − (1 + log(− 2 )) = − (1 + log(2π)) + 2logσ) (32)
2 θ 2
Yunshu Liu (ASPITRG) Introduction to Information Geometry 66 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Geometric structure of statistical models and statistical inference

Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference

Yunshu Liu (ASPITRG) Introduction to Information Geometry 67 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Divergences

Let S be a manifold and suppose that we are given a smooth function


D = D(·k·) : S × S → R satisfying for any p, q ∈ S:

D(pkq) > 0 with equality iff . p = q) (33)

Then we introduce a distance-like measure of the separation between two


points.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 68 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Divergence, semimetrics and metrics

A distance satisfying positive-definiteness, symmetry and triangle


inequality is called a metric;
A distance satisfying positive-definiteness and symmetry is called
semimetrics;
A distance satisfying only positive-definiteness is called a divergence.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 69 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Kullback-Leibler divergence

Discrete random variables p and q:


X p(x)
DKL (pkq) = p(x)log (34)
q(x)
i

Continuous random variables p and q:


Z
p(x)
DKL (pkq) = p(x)log dx (35)
q(x)

Generally , we use Kullback-Leibler divergence to measurethe difference


between two probability distributions p and q. KL measures the expected
number of extra bits required to code samples from p when using a code
based on q, rather than using a code based on p. Typically p represents the
”true” distribution of data, observations, or a precisely calculated theoretical
distribution. The measure q typically represents a theory, model, description,
or approximation of p.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 70 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference

Bregman divergence

Bregman divergence associated


with F for points p, q ∈ ∆ is :
BF (xky) =
F(y) − F(x) − h(y − x), ∇F(x)i,
where F(x) is a convex function
defined on a closed convex set
∆.

Examples:
F(x) = kxk2 , then BF (xky) = kx − yk2 .
More generally, if F(x)P= 21 xT Ax, then 1 T
P BF (xky) = 2 (x − y) A(x − y).
KL divergence:if F = i x logx − x, we get KullbackLeibler divergence.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 71 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Canonical divergence(a divergence for dually flat space)


Let (S, g, ∇, ∇∗ ) be a dually flat space, and {[θi ], [ηj ]} be mutually dual affine
coordinate systems with potentials {ψ, ϕ}, then the canonical
divergence((g, ∇) − divergence) is defined as:

D(pkq) = ψ(p) + ϕ(q) − θi (p)ηj (q) (36)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 72 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Properties:
Relation between (g, ∇) − divergence and (g, ∇∗ ) − divergence:
D∗ (pkq) = D(qkp)
If M is a autoparallel submanifold w.r.t. either ∇ or ∇∗ , then the
(gM , ∇M )-divergence DM = D|M×M is given by DM (pkq) = D(pkq)
If ∇ is a Riemannian connection(∇ = ∇∗ ) which is flat on S, there exist
a coordinatePsystem which is self-dual(θi = ηi ), then
1
ϕ = ψ = 2 i (θi )2 , then the canonical divergence is

1
D(pkq) = {d(p, q)}2
2
pP
where d(p, q) = i {θ
i (p) − θi (q)}2

Yunshu Liu (ASPITRG) Introduction to Information Geometry 73 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Triangular relation
Let {[θi ], [ηi ]} be mutually dual affine coordinate systems of a dually flat
space (S, g, ∇, ∇∗ ), and let D be a divergence on S. Then a necessary and
sufficient condition for D to be the (g, ∇)-divergence is that for all p, q, r ∈ S
the following triangular relation holds:

D(pkq) + D(qkr) − D(pkr) = {θi (p) − θi (q)}{η i (p) − η i (q)} (37)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 74 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Pythagorean relation
Let p, q, and r be three points in S. Let γ1 be the ∇-geodesic connecting p and
q, and let γ2 be the ∇∗ -geodesic connecting q and r. If at the intersection q the
curve γ1 and γ2 are orthogonal(with respect to the inner product g), then we
have the following Pythagorean relation.

D(pkr) = D(pkq) + D(qkr) (38)

Yunshu Liu (ASPITRG) Introduction to Information Geometry 75 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Projection theorem
Let p be a point in S and let M be a submanifold of S which is
∇∗ -autoparallel. Then a necessary and sufficient condition for a point q in M
to satisfy
D(pkq) = minr∈M D(pkr) (39)
is for the ∇-geodesic connecting p and q to be orthogonal to M at q.

Yunshu Liu (ASPITRG) Introduction to Information Geometry 76 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Examples
From the definition of exponential family and mixture family, the product of
exponential family are still exponential family, the sum of mixture family are
still mixture family.
e-flat submanifold: set of all Q product distributions:
E0 = {pX |pX (x1 , · · · , xN ) = Ni=1 pXi (xi )}
m-flat submanifold:
P set of joint distributions with given marginals:
M0 = {pX | X\i pX (x) = qi (xi ) ∀i ∈ {1, · · · , N}}

Yunshu Liu (ASPITRG) Introduction to Information Geometry 77 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Canonical divergence

Examples

Yunshu Liu (ASPITRG) Introduction to Information Geometry 78 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference

Thanks!

Thanks!
Question?

Yunshu Liu (ASPITRG) Introduction to Information Geometry 79 / 79

You might also like