0% found this document useful (0 votes)
67 views52 pages

FrankNielsen Soph IA 21nov2019

This document discusses the geometry of machine learning and artificial intelligence. It introduces several geometric learning machines such as Riemannian manifold learning and kernel machines. It also discusses information geometry and the dual geometric structures of information projections. Deep learning is framed as learning trajectories on neuromanifolds. Computational geometry concepts like smallest enclosing balls and coresets are applied to kernel machines. The Fisher information matrix and Fisher-Rao geometry are also covered.

Uploaded by

Nacim Belkhir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views52 pages

FrankNielsen Soph IA 21nov2019

This document discusses the geometry of machine learning and artificial intelligence. It introduces several geometric learning machines such as Riemannian manifold learning and kernel machines. It also discusses information geometry and the dual geometric structures of information projections. Deep learning is framed as learning trajectories on neuromanifolds. Computational geometry concepts like smallest enclosing balls and coresets are applied to kernel machines. The Fisher information matrix and Fisher-Rao geometry are also covered.

Uploaded by

Nacim Belkhir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337424905

From geometric learning machines to the geometry of AI

Presentation · November 2019


DOI: 10.13140/RG.2.2.22415.53923

CITATIONS READS

0 297

1 author:

Frank Nielsen
Professor
404 PUBLICATIONS   3,804 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Computational Information Geometry View project

High Performance Computing (HPC) View project

All content following this page was uploaded by Frank Nielsen on 21 November 2019.

The user has requested enhancement of the downloaded file.


From
geometric learning machines
to the
geometry of AI
Frank Nielsen
@FrnkNlsn
21st November 2019
Outline of this talk Computational geometry
for ML!

• Introduction to some famous geometric learning machines


• Unsupervised learning: Riemannian manifold learning
• Supervised learning:
• Kernel machines and Hilbert geometry (RKHS)
• Deep learning and trajectories on neuromanifolds
• Information geometry and information projections:
• The dualistic geometric structures
• Geometry of interpolation machines:
• Double descent learning curves
Non-linear versus linear dimension reduction
• Many non-linear techniques:
• LLE,
• ISOMAP
• etc.
• In very high dimensions, linear dimension projection (curse
of dimensionality). Johnson-Lindenstrauss’ theorem (1984)

Introduction to HPC with MPI for Data Science, Springer, 2016


Riemannian manifolds: Extrinsic view vs intrinsic view
Visualized extrinsically as smooth surfaces of the ambient Euclidean
space: isometric embedding theorem (Nash embedding theorem)
Isometric Extrinsic geometry
embedding:

Manifold learning/reconstruction
Intrinsic geometry from data points (Swiss roll)

Intrinsic geometry versus extrinsic isometric embedding


Kernel Support Vector Machines (SVMs)
Linear separator Non-linear separator
Support vector

Feature map

Inner product
(Hilbert space)

Kernel machines
1992
1970 (RKHS geometry)
SVM: Dual quadratic program amount to solve for a
smallest enclosing ball (= SEB): Computational geometry !

Smallest enclosing ball:


Smallest ball with respect to
radius or set inclusion
Approximating Smallest Enclosing Balls with Applications to Machine Learning, IJGA 2009
Coresets for Smallest Enclosing Balls and Core VMs
• Definition: A coreset is a subset such that [BC 2002]

Independent of input size n


Independent of dimension d
Only dependent of epsilon 
Selected points inside boxes

Apply to finite point sets in infinite-dimensional spaces too!


-> Kernel machines with Core Vector Machines
A note on kernelizing the smallest enclosing ball for machine learning, 2017
Introduction to HPC with MPI for Data Science, Springer, 2016
Computing a coreset for the Smallest Enclosing Balls
Extremely simple algorithm  !
#iterations:

Running time:

On approximating the Riemannian 1-center. Comput. Geom. 46(1): 93-104 (2013)


Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry. GSI 2015
Approximating the kernelized minimum enclosing ball
Kernel with feature map
(D may be infinite)
Trick: Encode implicitly the circumcenter of the enclosing ball as a
convex combination of the data points:

Update weights iteratively:


Index of the current farthest point
Applications: Support Vector Data Description, Support Vector Data Description
A note on kernelizing the smallest enclosing ball for machine learning, 2017
The 1-layer perceptron: linear separator machine
Frank Rosenblatt:
Principles of Neurodynamics, 1962

Marvin Minsky and Seymour Papert:


Perceptrons: An Introduction to Computational Geometry, 1969
Stochastic MLPs and neuromanifolds
Neurodynamics=
Learning trajectory on the manifold

Parameter space

Hidden layers: universal function approximators (post XOR)


Supervised learning: gradient descent + backpropagation
Global geometric objects
Smooth manifold vs
Local descriptions
M in local chart coordinates
p

U
V
Locally Euclidean
(homeomorphism) Atlas= set of charts
Coordinates in charts+transition maps
Φv
n
Φu
Φu (U ∩ V)

Φv(V)
Φu(U)
Φuv
UV mapping in computer graphics
Fisher information matrix (FIM)
𝜕𝜕 𝜕𝜕
g(ξ)=Eξ[ log(pξ) log(pξ)]
𝜕𝜕ξ i
𝜕𝜕ξ j

𝜕𝜕 𝜕𝜕
gij(ξ)=∫ log(pξ(x)) log(pξ(x))pξ(x)dx Sir Ronald Fisher
𝜕𝜕ξ i
𝜕𝜕ξ j

FIM is positive-semidefinite, positive-definite for regular models

1922
Fisher-Rao geometry and geodesic distance
Fisher information metric (FIM)

independent to
reparameterization of θ

Riemannian
geodesics locally
minimize
lengths
Riemannian geometry of normal distributions
equipped with FIM: hyperbolic geometry

Metric tensor:
Measure vector
lengths, angles on
tangent space

Pseudo-sphere
(negative curvature -1/2)

Pattern recognition in nuclear fusion data by means of geometric methods in probabilistic spaces, 2017
Cramer-Rao lower bound: Inverse of Fisher information
Löwner partial ordering on positive-semi-definite matrices:

Accuracy of estimator

CRLB Theorem: depends on true


parameter

Under regularity conditions:

Cramer-Rao lower bound and information geometry. Connected at Infinity II, 2013.
Calculating Rao’s distance is often untractable

• Need to solve the Ordinary Differential Equation (ODE) for find the geodesic
(but trivial in 1D):

• Need to integrate the infinitesimal length elements along the geodesics…

No closed-form Fisher-Rao distance between multivariate normals!


-> geodesic shooting (BVP: boundary value problem, IVP: initial value problem)
Using the Fisher Information Matrix without geodesics?
Ordinary steepest gradient descent method
• Iterative optimization algorithm
• Start from an initial parameter value
• Update iteratively the current parameter using a learning rate α (step
size) and the gradient of the energy function:

• First-order optimization method


• Zig-zag local minimum convergence
• Stopping criterion

Similarly, maximization with hill climbing, steepest ascent


Steepest descent in a Riemannian space:
The natural gradient
• The steepest descent direction of E(θ) in a Riemannian space is given
by

Type checking:
Contravariantfo
rm of the
ordinary
gradient
Learning rate

Computing the inverse of the Fisher information matrix is tricky!


Amari, Shun-Ichi. "Natural gradient works efficiently in learning." Neural computation 10.2 (1998): 251-276.
Pros and cons of natural gradient
• Pros:
• Invariant (intrinsic) gradient (at infinitesimal scale/ODE)
• Not trapped in plateaus (close to degenerate FIM)
• Achieve Fisher efficiency in online learning

• Cons:
• Too expensive to compute (no closed-form FIM; need matrix inversion;
numerical stability) -> Other Riemannian metrics studied
• Degenerate for irregular models (e.g., hierarchical models, Deep learning)
• Need to adapt the step size
Relative Fisher Information Matrix (RFIM) and
Relative Natural Gradient (RNG) for deep learning

Relative Fisher IM:

Dynamic
geometry!

The RFIMs of single neuron models, a linear layer, a non-linear layer, a soft-
max layer, two consecutive layers all have simple RFIM closed form solutions
Relative Fisher Information and Natural Gradient for Learning Large Modular Models (ICML'17)
Dualistic structures
of
Information geometry

and
Information projections
(may seem at first counterintuitive)
An elementary introduction to information geometry, arXiv:1808.08271
An essential concept: Affine Connection 𝛻𝛻
• Define how to “parallel transport” a vector from one tangent plane
to another tangent plane by infinitesimally parallel shifting it along a
curve (thus generally depend on the curve)
• 𝛁𝛁-geodesics = autoparallel curves
Vp=V(p)
V(q)
q

𝛻𝛻 γ̇
γ
Elie Joseph Cartan
(1869-1951)
https://arxiv.org/abs/1808.08271
Curvature of
a connection 𝛁𝛁 Osculating circle

Sectional
curvatures

Cylinder is flat:
Parallel transport is
path-independent Sphere has constant curved curvature:
Parallel transport is path-dependent
Dual exponential/mixture affine connections
Historically, built the e-connection (exponential) and
m-connection (mixture) for statistical models
Log-likelihood

e-connection

m-connection
May not need DUAL CONNECTIONS wrt. the
distances here Fisher information (Riemannian) metric
Dualistic structure of the Gaussian manifold
∇: e-connection, flat
∇*:m-connection, flat
Dually flat space!
(both connections flat)

M-geodesic

E-geodesic
In a Dually flat space, dual Pythagoras’ theorem

Bregman
manifold
induced by a
convex function

Two (affine) coordinate systems coupled by Legendre-Fenchel transformation


Two dually flat connections with respect to the metric tensor
Canonical distance = Bregman divergence induced by convex generator F
Generalize Euclidean space,  very practical  for computing!
Geodesic triangles in Bregman manifolds

3 vertices define 6 geodesic edges from which


8 geodesic triangles can be built, defining 18 interior angles

Geodesic triangle
with two right angles Geometry
(geometry is NOT CONFORMAL) induced by dual
convex potentials
https://arxiv.org/abs/1910.03935
Recalling projections in Euclidean geometry….
Orthogonality and uniqueness of projection

Proof using the Pythagoras’ theorem, min. distance

Guaranteed unique projection Non-unique projection


On uniqueness of projections in dually flat spaces
Projection to a submanifold
with respect to a connection

In dually flat spaces, uniqueness of projections when


Maximum likelihood estimator
for an exponential family as
an information m-projection
Exponential Family Manifold (EFM) is e-flat
Observed point

KL=
Relative
entropy

What is an information projection?, Notices AMS 65.3 (2018)


MaxEnt as an information e-projection
• MaxEnt linear constraints define a m-flat

Pythagoras’ theorem (Fisher orthogonality)

What is an information projection?, Notices AMS 65.3 (2018)


Geometric framework yields interpretations of
MLE/MaxEnt of KL divergence minimizations
as information projections (and uniqueness proofs)
MaxEnt (with prior q) Maximum Likelihood Estimate
e-projection on m-flat m-projection on e-flat
Divergences: Statistical (oriented) distances or
smooth parametric distances
• In information theory, relative entropy called Kullback-Leibler divergence

• KLD can be extended to f-divergences

• Plug f(u)=-log(u) and the f-divergence is the KLD


Classes of distances: Csiszar’s f-divergence
• Function f convex, strictly convex at 1, with f(1)=0

On the chi square and higher-order chi distances for approximating f-divergences, IEEE SPL 2013
The Fisher information matrix and f-divergences

• Fisher Information Metric (FIM): Fisher Riemannian metric

• Infinitesimally, the Kullback-Leibler or f-div related to the FIM


Invariant divergence = f-divergences
• Lump or coarse-bin a separable distance, and ask for
information monotonicity

Theorem: The only monotone separable divergences are f-divergences


(except for the curious case of binary alphabets),
f-divergences are invariant by diffeormorphisms of the sample space
Dual connections from any divergence! (M, 𝐷𝐷g, 𝐷𝐷𝜵𝜵, 𝐷𝐷𝜵𝜵*)

Dual connections from any smooth parametric distance,


called a (parameter) divergence D: D is not necessarily symmetric
𝜕𝜕 𝜕𝜕
• a tensor metric g: gij(pξ)= 𝑖𝑖 𝑗𝑗2D(pξ1,pξ2)|ξ1=ξ2=ξ
𝜕𝜕ξ 𝜕𝜕ξ
1

• a torsion-less affine connection 𝛻𝛻:


𝜕𝜕 𝜕𝜕 𝜕𝜕
Γijk(p 𝜉𝜉 )= − 𝑖𝑖 𝑗𝑗 𝑘𝑘 D(pξ1,pξ2)|ξ1=ξ2=ξ
𝜕𝜕ξ 𝜕𝜕ξ 𝜕𝜕ξ
1 2 2
Dual divergences
D*(pξ1,pξ2)= D(pξ2,pξ1)
and dual connections
Symmetric divergences yields the same connection:
The Levi-Civita connection
Which geometry is best suited for
clustering normalized histograms?
Bag of words
Fisher-Rao geometry of the
categorical distribution (standard simplex)
• Trinomial (trinoulli)
Fisher information metric:

Square root embedding

(Hotelling)-Fisher-Rao distance:

Embedding to the sphere positive orthant

Pattern Learning and Recognition on Statistical Manifolds: An Information-Geometric Review, SIMBAD 2013
Clustering in Hilbert simplex geometry, arXiv:1704.00454
Hilbert log cross-ratio metric

Geodesics are straight lines but not unique

Finsler
geometry!

Clustering in Hilbert simplex geometry. CoRR abs/1704.00454 (2017)


Hilbert log cross-ratio metric for the standard simplex
Experiments
With
K-means

Clustering in Hilbert simplex geometry. CoRR abs/1704.00454 (2017)


Sailing on a sea of distances:
• Which distance is suitable?
• Which loss function to minimize?
• Which “metric” to evaluate?
•…
• Are there first principles?
Classes of distances: Bregman divergence
• Bregman divergence between parameters for a strictly convex and
differentiable convex function F Geometric
design

Unify squared Euclidean geometry


and geometry of information theory

• The canonical divergence of dually flat spaces


Mining matrix data with Bregman matrix divergences for portfolio selection."Matrix Information Geometry.
Springer, Berlin, Heidelberg, 2013. 373-402.
Jensen difference/Jensen divergence (Burbea-Rao)
Geometric
design
• Introduced by Burbea and Rao
• Vertical gap induced by Jensen inequality

Asymptotic scaled Jensen


divergence amount to a Bregman
or reverse Bregman divergence

The Burbea-Rao and Bhattacharyya centroids." IEEE Transactions on Information Theory 57.8 (2011): 5455-5466.
Bregman chord divergence: https://arxiv.org/abs/1810.09113
A family of statistical symmetric divergences based on Jensen's inequality, arXiv:1009.4004
Classical wisdom of machine learning:
The bias-variance tradeoff…

We used to think: Do not overfit Test sample for evaluating


for better generalization! generalization error
Modern view of machine learning:
The age of Interpolation machines!
• Ok, let us do zero-training error! (Gaussian processes or Neural Networks)

Neural networks
perform well even when
overparameterizated

• But to have good models, let us set Occam’s razor principle to choose the
smoothest interpolating function

Deep Neural Networks as Gaussian Processes, https://arxiv.org/abs/1711.00165


Modern view of machine learning:
The double descent view/regime of models

Semi-Riemannian
geometry of
neuromanifold
and learning
trajectories

Reconciling modern machine learning practice and the bias-variance trade-off, https://arxiv.org/abs/1812.11118

Lightlike Neuromanifolds, Occam's Razor and Deep Learning, 1905.11027


Computational information
Concluding remarks geometry
for ML+AI!

• From the very beginning, computational geometry played


a major role in machine learning!

• Geometry: Design guiding principle promoting insightful intuition


and science of invariance. Meaning of distances.

• Dualistic structure of information geometry + information projection:


Role of Fisher information matrix/metric in ML (Fisher kernel, etc.)
Theory of communication between data/(sub)models and models
(but may seem at first counterintuitive)
Geometric Science of Information (GSI)
Co-organized every 2 years since 2013

180 participants, August, Toulouse, France, 2019

Joint Structures and Common Foundations of


Statistical Physics, Information Geometry and
Inference for Learning
26th July to 31st July 2020
Ecole de Physique des Houches
View publication stats

You might also like