Yunshu InformationGeometry PDF
Yunshu InformationGeometry PDF
Yunshu Liu
2012-02-17
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Outline
Part I
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel
Manifold
Manifold S
Manifold: a set with a coordinate system, a one-to-one mapping from S to Rn ,
supposed to be ”locally” looks like an open subset of Rn ”
Elements of the set(points): points in Rn , probability distribution, linear
system.
Manifold
Manifold S
Definition: Let S be a set, if there exists a set of coordinate systems A for S
which satisfies the condition (1) and (2) below, we call S an n-dimensional
C∞ differentiable manifold.
(1) Each element ϕ of A is a one-to-one mapping from S to some open
subset of Rn .
(2) For all ϕ ∈ A, given any one-to-one mapping ψ from S to Rn , the
following hold:
ψ ∈ A ⇔ ψ ◦ ϕ−1 is a C∞ diffeomorphism.
Here, by a C∞ diffeomorphism we mean that ψ ◦ ϕ−1 and its inverse ϕ ◦ ψ −1
are both C∞ (infinitely many times differentiable).
Examples of Manifold
x(ϕ, θ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ), 0 < ϕ < π/2, 0 6 θ < 2π (1)
2u 2v 1 − u2 − v2
x(u, v) = ( , , ) where u2 + v2 6 1 (2)
1 + u2 + v2 1 + u2 + v2 1 + u2 + v2
Submanifolds
Submanifolds
Definition: a submanifold M of a manifold S is a subset of S which itself has
the structure of a manifold
An open subset of n-dimensional manifold forms an n-dimensional
submanifold.
One way to construct m(<n)dimensional manifold: fix n-m coordinates.
Examples:
Submanifolds
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel
Curves
Curve γ: I → S from some interval I(⊂ R) to S.
Examples: curve on sphere, set of probability distribution, set of linear
systems.
Using coordinate system {ξ i } to express the point γ(t) on the curve(where t ∈
I): γ i (t) = ξ i (γ(t)), then we get γ̄(t) = [γ 1 (t), · · · , γ n (t)].
C∞ Curves
C∞ : infinitely many times differentiable(sufficiently smooth).
If γ̄(t) is C∞ for t ∈ I, we call γ a C∞ on manifold S.
γ(a + h) − γ(a)
γ̇(a) = lim (3)
h→0 h
In general, however, this is not true, ex: the range of γ in a color model
Thus we use a more general ”derivative” instead:
n
X ∂
γ̇(a) = γ̇ i (a)( )p (4)
∂ξ i
i=1
d i ∂
where γ i (t) = ξ i ◦ γ(t), γ̇ i (a) = dt γ (t)|t=a and ( ∂ξ i )p is an operator which
∂f
maps f → ( ∂ξ i )p for given function f : S → R.
Tangent space
Tangent space
Tangent space at p: a hyperplane Tp containing all the tangents of curves
passing through the point p ∈ S. (dim Tp (S) = dim S)
n
X ∂
Tp (S) = { ci ( i )p |[c1 , · · · , cn ] ∈ Rn }
∂ξ
i=1
Vector fields
Vector fields
Vector fields: a map from each point in a manifold S to a tangent vector.
Consider a coordinate system {ξi } for a n-dimensional manifold, clearly
∂
∂i = ∂ξ i are vector fields for i = 1, · · · , n.
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel
Riemannian Metrics
0
Riemannian Metrics: an inner product of two tangent vectors(D and D ∈
0
Tp (S)) which satisfy h D, D ip ∈ R, and the following condition hold:
0 00 00 0 00
Linearity : haD + bD , D ip = ahD, D ip + bhD , D ip
0 0
Symmetry : hD, D ip = hD , Dip
Positive − definiteness : If D 6= 0 then hD, Dip > 0
Riemannian Metrics
Riemannian Metrics
x(ϕ, θ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ), 0 < ϕ < π, 0 6 θ < 2π (5)
we have:
Affine connection
Affine connection
Tangent space:
n
X ∂
Tp (S) = { ci ( )p |[c1 , · · · , cn ] ∈ Rn }
∂ξ i
i=1
Affine connection
0
If the difference between the coordinates of p and p are very small, that we
can ignore the second-order infinitesimals (dξ i )(dξ j ), where
0
dξ i = ξ i (p ) − ξ i (p), then we can express difference between Πp,p0 ((∂j )p ) and
((∂j )p0 ) as a linear combination of {dξ 1 , · · · , dξ n }:
X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 ) (7)
i,k
Affine connection
X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 )
i,k
Given a connection on the manifold S, the value of (Γkij )p are different for
different coordinate systems, it shows how tangent vectors changes on a
manifold, thus shows how basis vectors changes.
In X
Πp,p0 ((∂j )p ) = (∂j )p0 − (dξ i (Γkij )p (∂k )p0 )
i,k
Example(cont.):
Now if we want to let the connection coefficients for Polar Coordinates to be
zero, Γkij = 0 for i, j, k = r, ϕ, we can calculate the connection coefficients for
2 ϕ)
− sin2 ϕ cos ϕ
Polar Coordinates: Γxxx = r , Γyxx = sin ϕ(1+cos
r ,
− sin3 ϕ y y − cos3 ϕ cos ϕ(1+sin2 ϕ)
Γxxy = x
Γyx = r , Γxy = Γyx = r
x
, Γyy = r , and
− sin ϕ cos2 ϕ
Γyyy = r .
Affine connection
Affine connection
Affine connection
metric connection
Definition: If for all vector fields X, Y, Z ∈ T (S),
where ZhX, Yi denotes the derivative of the function hX, Yi along this vector
field Z, we say that ∇ is a metric connection w.r.t. g.
Equivalent condition: for all basis ∂i , ∂j , ∂k ∈ T (S),
Levi-Civita connection
For a given connection, when Γkij = Γkji hold for all i, j and k, we call it a
symmetric connection or torsion-free connection.
From ∇∂i ∂j = nk=1 Γkij ∂k , we know for a symmetric connection:
P
∇ ∂i ∂ j = ∇ ∂j ∂ i
If a connection is both metric and symmetric, we call it the Riemannian
connection or the Levi-Civita connection w.r.t. g.
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector field
Riemannian metric and Affine connection
Flatness and autoparallel
Flatness
Flatness
S is flat w.r.t the connection ∇: an affine coordinate system exist for the
connection ∇.
Flatness
Examples:
Pn k
∇ ∂i ∂ j = k=1 (Γij ∂k ) = 0 for all i and j
Γkij = 0 for all i, j and k
Flatness
Flatness
Curvature
Curvature R = 0 iff parallel translation does not depend on curve choice.
Curvature is independent of coordinate system, under Riemannian
connection, we can calculate:
Curvature of 2 dimensional plane: R = 0;
Curvature of 3 dimensional sphere: R = r22 .
Autoparallel submanifold
Autoparallel submanifold
Geodesics
Geodesics(autoparallel curves): A curve with tangent vector transported by
parallel translation.
Examples under Riemannian connection:
2 dimensional flat plane: straight line
3 dimensional sphere: great circle
Autoparallel submanifold
Geodesics
The geodesics with respect to the Riemannian connection are known to
coincide with the shortest curve joining two points.
Shortest curve: curve with the shortest length.
Length of a curve γ : [a, b] → S:
Z b Z bq
dγ
kγk = k kdt = gij γ̇ i γ̇ j dt (22)
a dt a
Part II
Motivation
Motivation
Consider the set of probability distributions as a manifold.
Analysis the relationship between the geometric structure of the manifold and
statistical estimation.
Statistical models
Statistical models
Z
P(X ) = {p : X → R | p(x) > 0 (∀x ∈ X ), p(x)dx = 1} (23)
Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference
where `ξ = `(x; ξ) = log p(x; ξ) and Eξ denotes the expectation w.r.t. the
distribution pξ .
Motivation:
Sufficient statistic and Cramér-Rao bound
Sufficient statistic
Sufficient statistic: for Y = F(X), given the distribution p(x; ξ) of X, we have
p(x; ξ) = q(F(x); ξ)r(x; ξ), if r(x; ξ) does not depend on ξ for all x, we say
that F is a sufficient statistic for the model S. Then we can write
p(x; ξ) = q(y; ξ)r(x).
A sufficient statistic is a function whose value contains all the information
needed to compute any estimate of the parameter (e.g. a maximum likelihood
estimate).
Cramér-Rao inequality
Cramér-Rao inequality
The variance of any unbiased estimator is at least as high as the inverse of the
Fisher information.
ˆ Eξ [ξ(X)]
Unbiased estimator ξ: ˆ =ξ
ˆ = [vij ] where
The variance-covariance matrix Vξ [ξ] ξ
α-connection
α-connection
(α)
Let S = {pξ } be an n-dimensional model, and consider the function Γij,k
which maps each point ξ to the following value:
(α) 1−α
(Γij,k )ξ = Eξ [(∂i ∂j `ξ + ∂i `ξ ∂j `ξ )(∂k `ξ )] (24)
2
α-connection
Properties of α-connection
α-connection is a symmetric connection
Relationship between α-connection and β-connection:
(β) (0) −β
Γij,k = Γij,k + E[∂i `ξ ∂j `ξ ∂k `ξ ]
2
1 + α (1) 1 − α (−1)
∇(α) = ∇ + ∇
2 2
Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference
Exponential family
Exponential family
n
X
p(x; θ) = exp[C(x) + θi Fi (x) − ψ(θ)]
i=1
Exponential family
Exponential family
Examples: Normal Distribution
1 (x−µ)2
−
p(x; µ, σ) = √ e 2σ2 (26)
2πσ
µ
where C(x) = 0, F1 (x) = x, F2 (x) = x2 , and θ1 = σ2
, θ2 = − 2σ1 2 are the
natural parameters, the potential function is :
(θ1 )2 1 π µ2 √
ψ=− 2
+ log(− 2
) = 2
+ log( 2πσ) (27)
4θ 2 θ 2σ
Mixture family
Mixture family
n
X
p(x; θ) = C(x) + θi Fi (x)
i=1
In this case we say that S is a mixture family and [θi ] are called the mixture
parameters.
Dual connection
Dual connection
Definition: Let S be a manifold on which there is given a Riemannian metric
g and two affine connection ∇ and ∇∗ . If for all vector fields X, Y, Z ∈ T (S),
hold, we say that ∇ and ∇∗ are duals of each other w.r.t. g and call one the
dual connection of the other.
Additional, we call the triple (g, ∇, ∇∗ ) a dualistic structure on S.
Dual connection
Properties
For any statistical model, the α-connection and the (−α)-connection are
dual with respect to the Fisher metric.
R = 0 ⇔ R∗ = 0
where R and R∗ are the curvature tensors of ∇ and ∇∗ .
g = h∂i , ∂ j i = δij
∂ j ∂
where ∂i = ∂θ i and ∂ = ∂η .
j
Then we say the two coordinate systems mutually dual w.r.t. metric g, and
call one the dual coordinate system of the other.
Legendre transformations
Consider mutually dual coordinate system [θi ] and [ηi ] with functions
ψ : S → R and ϕ : S → R satisfy the following equations:
∂i ψ = η i
∂ i ϕ = θi
gi,j = ∂i ηj = ∂j ηi = ∂i ∂j ψ
ϕ(η) = maxθ {θi ηi − ψ(θ)}
ψ(θ) = maxη {θi ηi − ϕ(η)}
Legendre transformations
Examples:
The Legendre transform of f (x) = 1p |x|p (where 1 < p < ∞) is
f ∗ (x∗ ) = 1q |x∗ |q (where 1 < q < ∞),
The Legendre transform of f (x) = ex is f ∗ (x∗ ) = x∗ ln x∗ − x∗ (where
x∗ > 0),
The Legendre transform of f (x) = 12 xT Ax is f ∗ (x∗ ) = 21 x∗T A−1 x∗ ,
The Legendre transform of f (x) = |x| is f ∗ (x∗ ) = 0 if x∗ 6 1, and
f ∗ (x∗ ) = ∞ if x∗ > 1.
(θ1 )2 1 π µ2 √
ψ=− 2
+ log(− 2
) = 2
+ log( 2πσ) (31)
4θ 2 θ 2σ
∂ψ θ 1
The dual parameter are calculated as η1 = ∂θ1
= µ = − 2θ 2,
∂ψ (θ1 )2 −2θ2
η2 = ∂θ2
= µ2 + σ 2 = 4(θ2 )2
, It has potential function:
1 π 1
ϕ = − (1 + log(− 2 )) = − (1 + log(2π)) + 2logσ) (32)
2 θ 2
Yunshu Liu (ASPITRG) Introduction to Information Geometry 66 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Basic concepts
The Fisher metric and α-connection
Exponential family
Divergence and Geometric statistical inference
Divergences
Kullback-Leibler divergence
Bregman divergence
Examples:
F(x) = kxk2 , then BF (xky) = kx − yk2 .
More generally, if F(x)P= 21 xT Ax, then 1 T
P BF (xky) = 2 (x − y) A(x − y).
KL divergence:if F = i x logx − x, we get KullbackLeibler divergence.
Canonical divergence
Canonical divergence
Properties:
Relation between (g, ∇) − divergence and (g, ∇∗ ) − divergence:
D∗ (pkq) = D(qkp)
If M is a autoparallel submanifold w.r.t. either ∇ or ∇∗ , then the
(gM , ∇M )-divergence DM = D|M×M is given by DM (pkq) = D(pkq)
If ∇ is a Riemannian connection(∇ = ∇∗ ) which is flat on S, there exist
a coordinatePsystem which is self-dual(θi = ηi ), then
1
ϕ = ψ = 2 i (θi )2 , then the canonical divergence is
1
D(pkq) = {d(p, q)}2
2
pP
where d(p, q) = i {θ
i (p) − θi (q)}2
Canonical divergence
Triangular relation
Let {[θi ], [ηi ]} be mutually dual affine coordinate systems of a dually flat
space (S, g, ∇, ∇∗ ), and let D be a divergence on S. Then a necessary and
sufficient condition for D to be the (g, ∇)-divergence is that for all p, q, r ∈ S
the following triangular relation holds:
Canonical divergence
Pythagorean relation
Let p, q, and r be three points in S. Let γ1 be the ∇-geodesic connecting p and
q, and let γ2 be the ∇∗ -geodesic connecting q and r. If at the intersection q the
curve γ1 and γ2 are orthogonal(with respect to the inner product g), then we
have the following Pythagorean relation.
Canonical divergence
Projection theorem
Let p be a point in S and let M be a submanifold of S which is
∇∗ -autoparallel. Then a necessary and sufficient condition for a point q in M
to satisfy
D(pkq) = minr∈M D(pkr) (39)
is for the ∇-geodesic connecting p and q to be orthogonal to M at q.
Canonical divergence
Examples
From the definition of exponential family and mixture family, the product of
exponential family are still exponential family, the sum of mixture family are
still mixture family.
e-flat submanifold: set of all Q product distributions:
E0 = {pX |pX (x1 , · · · , xN ) = Ni=1 pXi (xi )}
m-flat submanifold:
P set of joint distributions with given marginals:
M0 = {pX | X\i pX (x) = qi (xi ) ∀i ∈ {1, · · · , N}}
Canonical divergence
Examples
Thanks!
Thanks!
Question?