A Geometric Modeling of Occam's Razor in Deep Learning
A Geometric Modeling of Occam's Razor in Deep Learning
Ke Sun
CSIRO’s Data61, Australia
[email protected], [email protected]
Frank Nielsen
Sony Computer Science Laboratories Inc. (Sony CSL)
arXiv:1905.11027v9 [cs.LG] 11 Jun 2025
Tokyo, Japan
[email protected]
Version: June 2025
Abstract
Why do deep neural networks (DNNs) benefit from very high dimensional parameter
spaces? Their huge parameter complexities vs. stunning performance in practice is all the
more intriguing and not explainable using the standard theory of model selection for regular
models. In this work, we propose a geometrically flavored information-theoretic approach to
study this phenomenon. With the belief that simplicity is linked to better generalization, as
grounded in the theory of minimum description length, the objective of our analysis is to
examine and bound the complexity of DNNs. We introduce the locally varying dimensionality
of the parameter space of neural network models by considering the number of significant
dimensions of the Fisher information matrix, and model the parameter space as a manifold
using the framework of singular semi-Riemannian geometry. We derive model complexity
measures which yield short description lengths for deep neural network models based on
their singularity analysis thus explaining the good performance of DNNs despite their large
number of parameters.
Keywords: Information geometry, Deep learning, Minimum Description Length, Fisher informa-
tion, Stochastic complexity
1 Introduction
Deep neural networks (DNNs) are usually large models in terms of storage costs. In the classical
model selection theory, such models are not favored as compared to simple models with the same
training performance. For example, if one applies the Bayesian information criterion (BIC) [70]
to DNN, a shallow neural network (NN) will be preferred over a deep NN due to the penalty
Sun, Ke and Nielsen, Frank, A Geometric Modeling of Occam’s Razor in Deep Learning,
Cite as: Information Geometry, Special Issue: Half a Century of Information Geometry, Part 2, 2025.
DOI: https://doi.org/10.1007/s41884-025-00167-2
This work first appeared under the former title “Lightlike Neuromanifolds, Occam’s Razor and Deep Learning”
in 2019.
1
term with respect to (w.r.t.) the complexity. A basic principle in science is the Occam1 ’s Razor,
which favors simple models over complex ones that accomplish the same task. This raises the
fundamental question of how to measure the simplicity or the complexity of a model.
Formally, the preference of simple models has been studied in the area of minimum description
length (MDL) [25, 66, 67], also known in another thread of research as the minimum message
length (MML) [77]. By the theory of MDL [25], statistical models that can most concisely
communicate the observed data are favored and expected to generalize better [9, 11, 24, 26, 83].
This is intuitive, as complex models often lead to overfit.
Consider a parametric family of distributions M = {p(x | θ)} with θ ∈ Θ ⊂ RD . The
distributions are mutually absolutely continuous, which guarantees all densities to have the same
support. Otherwise, many problems of non-regularity will arise as described by [28, 61]. The
Fisher information matrix (FIM) I(θ) is a D × D positive semi-definite (psd) matrix: I(θ) ⪰ 0.
The model is called regular if it is (i) identifiable [13] with (ii) a non-degenerate and finite Fisher
information matrix (i.e., I(θ) ≻ 0).
In a Bayesian setting, the description length of a set of N i.i.d. observations X = {xi }N i=1 ⊂ X
w.r.t. M can be defined as the number of nats with the coding scheme of a parametric model
p(x | θ) and a prior p(θ). The code length of any xi is given by the cross entropy between the
empirical
R distribution δi (x) = δ(x − xi ), where δ(·) denotes the Dirac’s delta function, and
p(x) = p(x | θ)p(θ) dθ. Therefore, the description length of X is
N
X N
X Z
− log p(X) = h× (δi : p) = − log p(xi | θ) p(θ) dθ, (1)
i=1 i=1
where h× (p : q) := − p(x) log q(x)dx denotes the cross entropy between p(x) and q(x), and log
R
denotes natural logarithm throughout the paper. The code length means the cumulative loss of
the Bayesian mixture model p(x) w.r.t. the observations X. Equation (1) corresponds to the
Bayesian universal code. In MDL, the optimal code in terms of the minimax strategy [71] is given
by the normalized maximum likelihood (NML) code. With a suitable choice of the prior, the
Bayesian universal code and the NML code asymptotically coincide [25] with O(1) difference.
By using Jeffreys2 ’ non-informative prior [3] as p(θ), the MDL in eq. (1) can be approximated
(see [7, 66, 67]) as
geometric complexity
z }| Z {
D N
(2)
p
χ = − log p(X | θ̂) + log + log |I(θ)| dθ ,
| {z } |2 {z 2π}
fitness penalize high dof
| {z }
model capacity
where θ̂ ∈ Θ is the maximum likelihood estimation (MLE), or the projection [3] of X onto the
model, D = dim(Θ) is the model size, N is the number of observations, and | · | denotes the
matrix determinant. In this paper, the symbols χ and O and the term “razor” all refer to the
same concept, that is the description length of the data X by the model M. The smaller those
quantities, the better.
The first term in eq. (2) is the fitness of the model to the observed data. The second and the
third terms measure the geometric complexity [50] and make χ favor simple models. The second
O(log N ) term only depends on the number of parameters D and the number of observations N .
It penalizes large models with a high degree of freedom (dof). The third O(1) term is independent
1William of Ockham (ca. 1287 — ca. 1347), a monk (friar) and philosopher.
2 Sir Harold Jeffreys (1891–1989), a British statistician.
2
to the observed data and measures the model capacity, or the total “number” of distinguishable
distributions [50] in the model.
Unfortunately, this razor χ in eq. (2) does not fit straightforwardly into DNNs, which are
high-dimensional singular models. The FIM I(θ) is a large singular matrix (not full rank) and
the last term may be difficult to evaluate. Based on the second term on the right-hand-side
(RHS), a DNN can have very high complexity and therefore is less favored against a shallow
network. This contradicts the good generalization of DNNs as compared to shallow NNs. These
issues call for a new analysis of the MDL in the DNN setting.
Towards this direction, we made the following contributions in this paper:
– New concepts and methodologies from singular semi-Riemannian geometry [41] to analyze
the space of neural networks;
– A definition of the local dimensionality in this space, that is the amount of non-singularity,
with bounding analysis;
– A connection between f -mean and DNN model complexity with related bounds;
– A new MDL formulation, which explains how the singularities contribute to the “negative
complexity” of DNNs: That is, the model turns simpler as the number of parameters grows.
The rest of this paper is organized as follows. Section 2 reviews singularities in information
geometry. In the setting of a DNN, section 3 introduces its singular parameter manifold. Section 4
bounds the number of singular dimensions of the parameter manifold of the DNN. Sections 5
to 8 derive our MDL criterion based on two different priors, and discuss how model complexity is
affected by the singular geometry. We discuss related work in section 9 and conclude in section 10.
Proofs and related derivations of our main results are provided in the appendix.
where Ep denotes the expectation w.r.t. p(x | θ). The corresponding infinitesimal squared length
element ds2 = tr(I(θ)dθdθ ⊺ ) = ⟨dθ, dθ⟩I(θ) = dθ ⊺ I(θ)dθ, where tr(·) means the matrix trace4 ,
is independent of the underlying parameterization of the population space.
Amari further developed this approach by revealing the dualistic structure of statistical
manifolds which extends the Riemannian framework [3, 54]. The MDL criterion arising from
3 To be more precise, a statistical manifold [42] is a structure (∇, g, C) on a smooth manifold M, where g is a
metric tensor, ∇ a torsion-free affine connection, and C is a symmetric covariant tensor of order 3.
4 Using the cyclic property of the matrix trace, we have ds2 = tr(I(θ)dθdθ ⊺ ) = dθ ⊺ I(θ)dθ.
3
a null curve
(equivalent models)
∂θi
θ
θ′
∂θi′
the geometry of Bayesian inference with Jeffreys’ prior for regular models is detailed in [7]. In
information geometry, the regular assumption is (1) an open connected parameter space in some
Euclidean space; and (2) the FIM exists and is non-singular. However, in general, the FIM is
only positive semi-definite and thus for non-regular models like neuromanifolds [3] or Gaussian
mixture models [78], the manifold is not Riemannian but singular semi-Riemannian [17, 41].
In the machine learning community, singularities have often been dealt with as a minor issue:
For example, the natural gradient has been generalized based on the Moore-Penrose inverse of
I(θ) [75] to avoid potential non-invertible FIMs. Watanabe [78] addressed the fact that most
usual learning machines are singular in his singular learning theory which relies on algebraic
geometry. Nakajima and Ohmoto [52] discussed dually flat structures for singular models.
Recently, preliminary efforts [6, 33] tackle singularity at the core, mostly from a mathematical
standpoint. For example, Jain et al. [33] studied the Ricci curvature tensor of such manifolds.
These mathematical notions are used in the community of differential geometry or general
relativity but have not yet been ported to the machine learning community.
Following these efforts, we first introduce informally some basic concepts from a machine
learning perspective to define the differential geometry of non-regular statistical manifolds. The
tangent space Tθ (M) is a D-dimensional (D = dim(M)) real vector space, that is the local
linear approximation of the manifold M at the point θ ∈ M, equipped with the inner product
induced by I(θ). The tangent bundle T M := {(θ, v), θ ∈ M, v ∈ Tθ } is the 2D-dimensional
manifold obtained by combining all tangent spaces for all θ ∈ M. A vector field is a smooth
mapping from M to T M such that each point θ ∈ M is attached a tangent vector originating
from itself. Vector fields are cross-sections of the tangent bundle. In a local coordinate chart θ,
the vector fields along the frame are denoted as ∂θi . A distribution (not to be confused with
probability distributions which are points on M) means a vector subspace of the tangent bundle
spanned by several independent vector fields, such that each point θ ∈ M is associated with a
4
subspace of Tθ (M) and those subspaces vary smoothly with θ. Its dimensionality is defined by
the dimensionality of the subspace, i.e., the number of vector fields that span the distribution.
In a lightlike manifold [17, 41] M, I(θ) can be degenerate. The tangent space Tθ (M) is
a vector space with a kernel subspace, i.e., a nullspace. A null vector field is formed by null
vectors, whose lengths measured according to the Fisher metric tensor are all zero. The radical 5
distribution Rad(T M) is the distribution spanned by the null vector fields. Locally at θ ∈ M,
the tangent vectors in Tθ (M) which span the kernel of I(θ) are denoted as Radθ (T M). In a
local coordinate chart, Rad(T M) is well defined if these Radθ (T M) form a valid distribution.
We write T M = Rad(T M) ⊕ S(T M), where ‘⊕” is the direct sum, and the screen distribution
S(T M) is complementary to the radical distribution Rad(T M) and has a non-degenerate induced
metric. See fig. 1 for an illustration of the concept of radical distribution.
We can find a local coordinate frame (a frame is an ordered basis) (θ1 , · · · , θd , θd+1 , · · · , θD ),
where the first d dimensions θ s = (θ1 , · · · , θd ) correspond to the screen distribution, and the
remaining d¯ := D − d dimensions θ r = (θd+1 , · · · , θD ) correspond to the radical distribution. The
local inner product ⟨·, ·⟩I satisfies
where δij = 1 if and only if (iff) i = j and δij = 0, otherwise. Unfortunately, this frame is
not unique [16]. We will abuse I to denote both the FIM of θ and the FIM of θ s . One has to
remember that I(θ) ⪰ 0, while I(θ s ) ≻ 0 is a proper Riemannian metric. Hence, both I −1 (θ s )
and log |I(θ s )| are well-defined.
Remark 1. Notice that the Fisher information matrix is covariant under reparameterization.
That is, let θ(λ) be an invertible smooth reparameterization of λ. Then the FIM rewrites in the
θ-parameterization as:
I(θ) = J⊺θ→λ I(λ(θ))Jθ→λ , (4)
where Jθ→λ is the full rank Jacobian matrix.
The natural gradient flows (vector fields on M) with respect to λ and θ coincide but not the
natural gradient descent methods (learning paths that consist of sequences of points on M) because
of the non-zero learning step sizes.
Furthermore, the ranks of I(θ) and I(λ) as well as the dimensions of the screen and radical
distributions coincide. Hence, the notion of singularities is intrinsic and independent of the
smooth reparameterization.
3 Lightlike Neuromanifold
This section instantiates the concepts in the previous section 2 in terms of a simple DNN predictive
model. The random variable x = (z, y) of interest consists of two components: z, referred to as
the “input”, and y, referred to as the “target”. By assumption, their joint probability distribution
is specified by
log p(x | ψ, θ) = log p(z | ψ) + log p(y | z, θ),
where p(z | ψ) is a generative model of z which is parameterized by ψ, p(y | z, θ) is a predictive
DNN, and θ consists of all neural network parameters.
Our main subject is the latter predictive model p(y | z, θ) and its parameter manifold Mθ .
Here, we need the generative model p(z | ψ) for the purpose of discussing how the geometry of
5Radical stems from Latin and means root.
5
Mθ is affected by the choice of p(z | ψ) and can be studied independent of the parameter space
of p(z | ψ), which we denote as Mψ . In the end, our results do not depend on the specific form
of p(z) or whether it is parametric.
For p(y | z, θ), we consider a deep feed-forward network with L layers, uniform width M
except the last layer which has m output units (m < M ), input z ∈ Z with dim(Z) = M ,
pre-activations hl of size M (except that in the last layer, hL has m elements), post-activations
z l of size M , weight matrices W l and bias vectors bl (1 ≤ l ≤ L). The layers are given by
z l = ϕ(hl ),
hl = W l z l−1 + bl , (5)
0
z = z,
where ϕ is an element-wise nonlinear activation function such as ReLU [22].
Without loss of generality, we assume multinomial6 output units and the DNN output [23]
y ∼ Multinomial SoftMax(hL )
denotes the softmax function. SoftMax(hL ) is a random Pmpoint inL∆ , the (m − 1) dimensional
m
6
As we are interested in the predictive model corresponding to the diagonal block I(θ), we
further have (see e.g. [57][72] for derivations)
L ⊺
∂hL (z)
∂h (z)
I(θ) = Ep(z) C(z) , (7)
∂θ ∂θ
where the expectation is taken w.r.t. p(z) := p(z | ψ), an underlying true distribution in the
L
input space depending on the parameter ψ. ∂h∂θ(z) is the m × D parameter-output Jacobian
matrix, based on a given input z, C(z) := diag (o(z)) − o(z)o(z)⊺ ⪰ 0, diag (·) means the
diagonal matrix with the given diagonal entries, and o(z) := SoftMax(hL (z)) is the predicted
class probabilities of z. By the definition of SoftMax, each dimension of o(z) represents a positive
probability, although o(z) can be arbitrarily close to a one-hot vector. As a result, the kernel of
the psd matrix C(z) is given by {λ1 : λ ∈ R}, where 1 is the vector of all 1’s.
In eq. (7), I(θ) is the single-observation FIM. It is obvious that the FIM w.r.t. the joint
distribution p(X | θ) of multiple observations is N I(θ) (Fisher information is additive), so that
I(θ) does not scale with N . In theory, computing I(θ) requires assuming p(z), which depends on
the parameter ψ. This makes sense as (ψ1 , θ) and (ψ2 , θ) with ψ1 ̸= ψ2 are different points on
the product manifold Mψ × Mθ and thus their I(θ) should be different. In practice, one only gets
access to a set of N i.i.d. samples drawn from an unknown p(z | ψ). In this case, it is reasonable to
PN
take p(z) in eq. (7) to be the empirical distribution p̂(z) so that p(z) = p̂(z) := N1 i=1 δ(z − zi ),
then
N ⊺
∂hL (zi ) ∂hL (zi )
1 X
I(θ) = Î(θ) := C(zi ) . (8)
N i=1 ∂θ ∂θ
The FIM computed in this way does not rely on the assumption of a parametric generative model
p(z | ψ) and the choice of a ψ. Î(θ) can be directly computed from the observed zi ’s and does
not depend on the observed yi ’s. Although denoted differently than I(θ) in the current paper,
this Î(θ) is a standard version of the definition of the FIM for neural networks [40, 47, 72, 73].
By considering the neural network weights and biases as random variables satisfying a
prescribed prior distribution [35, 60], this I(θ) can be regarded as a random matrix [48] depending
on the structure of the DNN and the prior. The empirical density of I(θ) is the empirical
PD
distribution of its eigenvalues {λi }Di=1 , that is, ρD (λ) = D
1
i=1 δ(λi ). If at the limit D → ∞,
the empirical density converges to a probability density function (pdf), then
is called the observed FIM (sample-based FIM), which is also known as the “empirical Fisher” in
machine learning literature [40, 47]. In our notations explained in table 1, the FIM I depends
7
on the true distribution p(z) and does not depend on the observed samples. In the expression
of the FIM in eq. (7), if p(z) = p̂(z), then I become Î, which depends on the observed input
zi ’s. The observed FIM J depends on both the observed input zi ’s and the observed target yi ’s.
If p(z) = p̂(z), the observed FIM coincides with the FIM at the MLE θ̂ and J(θ̂) = Î(θ̂). For
general statistical models, there is a residual term in between these two matrices which scales
with the training error (see e.g. Eq. 6.19 in section 6 of [4], or eq. (24) in the appendix). How
these different metric tensors are called is just a matter of terminology. One should distinguish
them by examining whether/how they depend (partially) on the observed information.
Table 1: The FIM and the observed FIM. The last three columns explain whether the tensor
depends on the observed zi ’s, whether it depends on the observed yi ’s, and whether they can be
computed in practice based on empirical observations.
4 Local Dimensionality
This section quantitatively measures the singularity of the neuromanifold. Our main definitions
and results do not depend on the settings introduced in the previous section and can be generalized
to similar models including stochastic neural networks [12]. For example, if the output units or
the network structure is changed, the expression of the FIM and related results can be adapted
straightforwardly. Our derivations depend on that (1) DNNs have a large amount of singularity
corresponding zero eigenvalues of the FIM; and (2) the spectrum of the (observed) FIM has many
eigenvalues close to zero [36]. That being said, our results also apply to singular models [78] with
similar properties.
Definition 1 (Local dimensionality). The local dimensionality d(θ) := rank (I(θ)) of the
neuromanifold M at θ ∈ M refers to the rank of the FIM I(θ). If p(z) = p̂(z), then
ˆ
d(θ) = d(θ) := rank Î(θ) .
The local dimensionality d(θ) is the number of degrees of freedom at θ ∈ M which can change
the probabilistic model p(y | z, θ) in terms of information theory. One can find a reparameterized
DNN with d(θ) parameters, which is locally equivalent to the original DNN with D parameters.
Recall the dimensionality of the tangent bundle is two times the dimensionality of the manifold.
Remark 2. The dimensionality of the screen distribution S(T M) at θ is 2 d(θ).
By definition, the FIM as the singular semi-Riemannian metric of M must be psd. Therefore
it only has positive and zero eigenvalues, and the number of positive eigenvalues d(θ) is not
constant as θ varies in general.
Remark 3. The local metric signature (number of positive, negative, zero eigenvalues of the
FIM) of the neuromanifold M is (d(θ), 0, D − d(θ)), where d(θ) is the local dimensionality.
The local dimensionality
d(θ) depends on the specific choice of p(z). If p(z) = p̂(z), then
ˆ
d(θ) = d(θ) = rank Î(θ) . On the other hand, one can use the rank of the negative Hessian
8
J(θ) (i.e., observed rank) to get an approximation of the local dimensionality d(θ) ≈ rank (J(θ)).
In the MLE θ̂, this approximation becomes accurate. We simply denote d and d, ˆ instead of d(θ)
ˆ
and d(θ), if θ is clear from the context.
We first show that the lightlike dimensions of M do not affect the neural network model in
eq. (5).
Lemma 1. If (θ, j αj ∂θj ) ∈ Rad(T M), i.e. ⟨ j αj ∂θj , j αj ∂θj ⟩I(θ) = 0, then almost surely
P P P
∂hL (z)
we have ∂θ α = λ(z)1, where λ(z) ∈ R.
L
By lemma 1, the Jacobian ∂h∂θ(z) is the local linear approximation of the map θ → hL . The
dynamic α (coordinates of a tangent vector) on M causes a uniform increment on the output
hL , which, after the SoftMax function, does not change the neural network map z → y.
Then, we can upper-bound the local dimensionality using the rank of the parameter-output
Jacobian ∂hL (z)/∂θ.
n L o
Proposition 2. ∀θ ∈ M, d(θ) ˆ ≤ PN min rank ∂h (zi ) , m − 1 .
i=1 ∂θ
Remark 4. While the total number D of free parameters is unbounded in DNNs, the local
dimensionality estimated by d(θ)ˆ grows at most linearly w.r.t. the sample size N , given fixed m
ˆ
(size of the last layer). If both N and m are fixed, then d(θ) is bounded even when the network
width M → ∞ and/or depth L → ∞.
The above bound is based on the inequality rank ( i Ai ) ≤ i rank (Ai ) for any matrices
P P
Ai , which could lead to loose bounds. Alternatively, an upper bound can be established directly
based on the definition of the matrix rank.
S L
∂h (z)
Proposition 3. For all θ ∈ M, we have d(θ) ≤ dim span z∈supp(p) Row ∂θ , where
supp(p) is the support of p(z), and Row(A) denotes the row vectors of the matrix A. Similarly,
ˆ
d(θ) has an upper bound obtained by replacing supp(p) with supp(p̂) on the RHS, i.e. the union
is over the observed zi ’s.
∂hL (z)
In summary, the less the rank of ∂θ , the more potential singularities in the neuromanifold.
∂hL (z)
Note the Jacobian ∂θ can be further written as
∂hL (z) ∂hL (z) ∂hL (z) ∂hL (z)
= ∂w1 , ∂w2 , · · · , ∂wL
, (10)
∂θ
where wl contains all parameters in the l’th layer (l = 1, · · · , L) and is obtained by stacking the
L
columns of W l and bl into a long vector. We can bound ∂h∂w(z) l individually as below.
Proposition 4. It holds that
L
∂h (z) L−1
rank l
≤ rank W L ΦL−1 W L−1 · · · W l+1 Φl ≤ min rank (Φs ) , (11)
∂w s=l
′ ′
l l l
where Φ = diag ϕ (h1 ), · · · , ϕ (hM ) is the Jacobian of the l’th activation layer.
Observe that the upper bounds in proposition 4 are monotonically decreasing with respect to
L. For example, we have
rank W L ΦL−1 · · · W l+1 Φl ≤ rank W L ΦL−1 · · · W l+2 Φl+1 . (12)
∂hL (z)
Based on these upper bounds, ∂wl
is potentially more singular for layers that are close to the
input.
9
Remark 5. If ϕ is ReLU, then the diagonal entries of Φs form a binary vector, and rank (Φs ) is
the number of activated neurons in the s’th layer. In this case, the upper bound minL−1 s
s=l rank (Φ )
means the smallest number of activated neurons across all layers.
To understand d(θ), one can parameterize the DNN, locally, with only d(θ) free parameters
while maintaining the same predictive model. The log-likelihood is a function of these d(θ)
parameters, and therefore its Hessian has at most rank d(θ). In theory, one can only reparameterize
M so that at one single point θ̂, the screen and radical distributions are separated based on
the coordinate chart. Such a chart may neither exist locally (in a neighborhood around θ̂) nor
globally.
The local dimensionality is not constant and may vary with θ. The global topology of the
neuromanifold is therefore like a stratifold [5, 18]. As θ has a large dimensionality in DNNs,
singularities are more likely to occur in M. Compared to the notion of intrinsic dimensionality [43],
our d(θ) is well-defined mathematically rather than based on empirical evaluations. One can
regard our local dimensionality as an upper bound of the intrinsic dimensionality, because a
very small singular value of I still counts towards the local dimensionality. Notice that random
matrices have full rank with probability 1 [19].
We can regard small singular values (below a prescribed threshold ε > 0) as ε-singular
dimensions, and use ε-rank defined below to estimate the local dimensionality.
Definition 2. The ε-rank of the FIM I(θ) is the number of eigenvalues of I(θ) which is not less
than some given ε > 0.
By definition, the ε-rank is a lower bound of the rank of the FIM, which depends on the θ-
parameterization — different parameterizations of the DNN may yield different ε-ranks of the
corresponding FIM. If ε → 0, the ε-rank of I(θ) becomes the true rank of I(θ) given by d(θ). The
spectral density ρI (probability distribution of the eigenvalues of I(θ)) affects the ε-rank of I(θ)
and the expected local dimensionality of M. On the support of ρI , the higher the probability of
the region [0, ε), the more likely M is singular. By the Cramér-Rao lower bound, the variance of
an unbiased 1D estimator θ̂ must satisfy
1
var(θ̂) ≥ I(θ)−1 ≥ .
ε
Therefore the ε-singular dimensions lead to a large variance of the estimator θ̂: a single observation
xi carries little or no information regarding θ, and it requires a large number of observations to
achieve the same precision. The notion of thresholding eigenvalues close to zero may depend on
the parameterization but the intrinsic ranks given by the local dimensionality are invariant.
In a DNN, there are several typical sources of singularities:
• First, if a neuron is saturated and gives constant output regardless of the input sample zi ,
then all dynamics of its input and output connections are in Rad(T M).
• Second, two neurons in the same layer can have linearly dependent output, e.g. when they
share the same weight vector and bias. They can be merged into one single neuron, as there
exists redundancy in the original parameterization.
• Third, if the activation function ϕ(·) is homogeneous, e.g. ReLU, then any neuron in the
DNN induces a reparametrization by multiplying the input links by α and output links by
1/αk (k is the degree of homogeneity). This reparametrization corresponds to a null curve
in the neuromanifold parameterized by α.
10
• Fourth, certain structures such as recurrent neural networks (RNNs) suffer from vanishing
gradient [23]. As the FIM is the variance of the gradient of the log-likelihood (known as
variance of the score in statistics), its scale goes to zero along the dimensions associated
with such structures.
It is meaningful to formally define the notion of “lightlike neuromanifold”. Using geometric
tools, related studies can be invariant w.r.t. neural network reparametrization. Moreover, the
connection between neuromanifold and singular semi-Riemannian geometry, which is used in
general relativity, is not yet widely adopted in machine learning. For example, the textbook [78]
in singular statistics mainly used tools from algebraic geometry which is a different field.
Notice that the Fisher-Rao distance along a null curve is undefined because there the FIM is
degenerate and there is no arc-length reparameterization along null curves [37].
Notice that the first order term vanishes because θ̂ is a local optimum of log p(X | θ), and in the
second order term, −N J(θ̂) is the Hessian matrix of the likelihood function log p(X | θ) evaluated
at θ̂. At the MLE, J(θ̂) ⪰ 0, while in general the Hessian of the loss of a DNN evaluated at θ ̸= θ̂
can have a negative spectrum [2, 68].
11
√
Through a change of variable ϕ := N (θ − θ̂), the density of ϕ is p(ϕ) = √1N p( √ϕN + θ̂) so
that M p(ϕ)dϕ = 1. In the integration in eq. (13), the term − N2 (θ − θ̂)⊺ J(θ̂)(θ − θ̂) has an
R
order of O(∥ϕ∥2 ). The cubic remainder term has an order of O( √1N ∥ϕ∥3 ). If N is sufficiently
large, this remainder can be ignored. Therefore we can write
N
− log p(X) ≈ − log p(X | θ̂) − log Ep exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂) . (14)
2
On the RHS, the first term measures the error of the model w.r.t. the observed data X. The
second term measures the model complexity. We have the following bound.
Proposition 5. We have ∀θ ∈ M,
N
0 ≤ − log Ep exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂)
2
N
≤ tr J(θ̂) (µ(θ) − θ̂)(µ(θ) − θ̂)⊺ + cov(θ) ,
2
where µ(θ) and cov(θ) denote the mean and covariance matrix of the prior p(θ), respectively.
Therefore the complexity is always non-negative and its scale is bounded by the prior p(θ).
The model has low complexity when θ̂ is close to the mean of p(θ) and/or when the variance of
p(θ) is small.
Consider
R the prior p(θ) = κ(θ)/ M κ(θ)dθ, where κ(θ) > 0 is a positive measure on M so
R
that 0 < M κ(θ)dθ < ∞. Based on the above approximation of − log p(X), we arrive at a
general formula
Z
O := − log p(X | θ̂) + log κ(θ)dθ
M
Z
N
− log ⊺
κ(θ) exp − (θ − θ̂) J(θ̂)(θ − θ̂) dθ, (15)
M 2
where “O” stands for Occam’s razor. Compared with previous formulations of MDL [7, 66, 67],
eq. (15) relies on a quadratic approximation of the log-likelihood function and can be instantiated
based on different assumptions of κ(θ). The non-normalized κ(θ) in Bayesian coding serves a
similar role to the luckiness function in NML coding [25], as they both incorporate prior knowledge
to favor certain parameters R in the parameter space M.
Informally, the term M κ(θ)dθ gives the total capacity of models in M specified by the
improper Rprior κ(θ), up to constant scaling. For example, if κ(θ) is uniform on a subregion in
M, then Mκ(θ)dθ corresponds to the size of this region w.r.t. the base measure dθ. The term
κ(θ) exp − 2 (θ − θ̂) J(θ̂)(θ − θ̂) dθ gives the model capacity specified by the posterior
N
R ⊺
M
p(θ | X) ∝ p(θ)p(X | θ) ∝ κ(θ) exp − N2 (θ − θ̂)⊺ J(θ̂)(θ − θ̂) . It shrinks to zero when the
number N of observations increases. The last two terms in eq. (15) is the log-ratio between the
model capacity w.r.t. the prior and the capacity w.r.t. the posterior. A large log-ratio means
there are many distributions
on M which have a relatively large value of κ(θ) but a small
value of κ(θ) exp − N2 (θ − θ̂)⊺ J(θ̂)(θ − θ̂) . The associated model is considered to have a high
complexity, meaning that only a small “percentage” of the models are helpful to describe the
given data.
DNNs have a large amount of symmetry: the parameter space consists of many pieces that
look exactly the same. This can be caused e.g. by permutating the neurons in the same layer.
12
This is a different non-local property than singularity that is a local differential property. Our
O is not affected by the model size caused by symmetry, because these symmetric models are
both counted in the prior and the posterior, and the log-ratio in eq. (15) cancels out symmetric
models. Formally, M has ζ symmetric pieces denoted by M1 , · · · , Mζ . Note any MLE on Mi is
mirrored on those ζ pieces. Then both integrations on the RHS of eq. (15) are multiplied by a
factor of ζ. Therefore O is invariant to symmetry.
Definition 3 (f -mean). Given a set T = {ti }ni=1 ⊂ R and a continuous and strictly monotonous
function f : R → R, the f -mean of T is
n
!
−1 1X
Mf (T) := f f (ti ) .
n i=1
The f -mean, also known as the quasi-arithmetic mean was studied in [38, 51]: Thus they are
also called Kolmogorov-Nagumo means [39]. By definition, the image of Mf (T) under f is the
arithmetic mean of the image of T under the same mapping. Therefore, Mf (T) is in between the
smallest and largest elements of T. If f (x) = x, then Mf becomes the arithmetic mean, which we
denote as T. We have the following bound.
Lemma 6. Given a real matrix T = (tij )n×m , we use ti to denote the i’th row of T , and t:,j to
denote the j’th column of T . If f (t) = exp(−t), then
Particular attention should be given to the second “≤”. If the arithmetic mean of each row is
first evaluated, and then their f -mean is evaluated, we get an upper bound of the arithmetic mean
of the f -mean of the columns. In simple terms, for f (t) = exp(−t), the f -mean of arithmetic
mean is lower bounded by the arithmetic mean Pof the f -mean. The proof is straightforward from
Jensen’s inequality, and by noting that − log i exp(−ti ) is a concave function of t. The last “≤”
leads to a proof of the upper bound in proposition 5. Lemma 6 pertains to the mean of a matrix.
It leads to bounds of the mean value of a bi-variable function w.r.t. a probability mass function
or a probability density function, or a mix of both. This is straightforward and omitted.
Remark 6. All instances of “≤” in lemma 6 are derived from Jensen’s inequality. Consequently,
the gaps of these bounds shrink as the variance of the matrix elements decreases, where the specific
way of measuring the variance depends on the particular “≤”. For instance, the gap associated
with the second “≤” becomes smaller as the variance across each row ti reduces.
Remark 7. The second complexity term on the RHS of eq. (14) is the f -mean of the quadratic
term N2 (θ − θ̂)⊺ J(θ̂)(θ − θ̂) w.r.t. the prior p(θ), where f (t) = exp(−t).
Prank(J(θ̂)) +
Based on the spectrum decomposition J(θ̂) = j=1 λj vj vj⊺ , where the positive eigen-
values λ+
j := λj (J(θ̂)) and the eigenvectors vj := vj (θ̂) depend on the MLE θ̂, we further write
+
13
this term as
rank(J(θ̂))
N X λ+
j N
(θ − θ̂)⊺ J(θ̂)(θ − θ̂) = tr(J(θ̂))⟨θ − θ̂, vj ⟩2 .
·
2 j=1 tr(J(θ̂)) 2
By lemma 6, we have
N
− log Ep exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂)
2
rank(J(θ̂))
λ+
X j N
≥− log Ep exp − tr(J(θ̂))⟨θ − θ̂, vj ⟩2 ,
j=1 tr(J(θ̂)) 2
λ+
where the f -mean and the mean w.r.t. j
tr(J(θ̂))
is swapped on the RHS. Denote φj = ⟨θ − θ̂, vj ⟩,
which in matrix form is written as φ = V (θ − θ̂). V has orthonormal columns and the j’th
⊺
where the RHS is a variance-like measure of p(θ) up to scaling, as it is the f -mean of N2 tr(J(θ̂))φ2j
evaluated at point θ̂ along the direction vj . In summary, we get a lower bound of the model
complexity, which is tighter than the lower bound in proposition 5, given by
N ⊺
− log Ep exp − (θ − θ̂) J(θ̂)(θ − θ̂)
2
rank(J(θ̂))
λ+
X j N
≥− 2
log Ep(φj ) exp − tr(J(θ̂))φj . (16)
j=1 tr(J(θ̂)) 2
The RHS is determined by the quantity N2 tr J(θ̂) φ2j after evaluating the f -mean and some
weighted mean, where φj is an orthogonal transformation of the local coordinates θi based on the
spectrum of J(θ̂). Recall that the trace of the observed FIM J(θ̂) means the overall amount of
information a random observation
contains w.r.t. the underlying model. Given the same sample
size N , a larger tr J(θ̂) indicates that the samples are more informative and the likelihood is
more sensitive to the choice of the parameters on M. Consequently, it is reasonable to regard the
model as more complex, because small changes of model parameters more easily lead to different
representations.
The bound in eq. (16) is tight when the variance of N2 tr J(θ̂) φ2j w.r.t. the discrete distribution
λ+
j
tr(J(θ̂))
is small. In the case when J(θ̂) is rank-one, “≥” becomes “=”. In practice, the FIM
of DNNs exhibits a pathological spectrum [36], where most eigenvalues of J(θ̂) are near zero,
with a small fraction taking large values. This means that the ϵ-rank of J(θ̂) is limited, and
λ+
the distribution j
tr(J(θ̂))
has lower variance as compared to a uniform spectrum. This distinctive
property of DNNs offers some basis for considering the lower bound in eq. (16) as a proxy of the
model complexity.
14
As θ̂ is the MLE, we have J(θ̂) = Î(θ̂). Recall from eq. (7) that the FIM Î(θ̂) is a numerical
average over all observed samples. We can have alternative lower bounds of the model complexity
based on lemma 6:
N ⊺
− log Ep exp − (θ − θ̂) J(θ̂)(θ − θ̂)
2
N L ⊺
∂hL (zi )
1 X N ∂h (zi )
≥− log Ep exp − (θ − θ̂)⊺ Ci (θ − θ̂)
N i=1 2 ∂θ ∂θ
N 2
1 X N ∂ log p(y | zi )
≥− Ep(y|zi ) log Ep exp − (θ − θ̂) . (17)
N i=1 2 ∂θ ⊺
The bounds are obtained by swapping the f -mean with the numerical average of the samples, and
by swapping f -mean with the expectation w.r.t. p(y | zi ). Therefore the model complexity can be
L L
bounded by the average scale of the vector ∂h∂θ(zi ) (θ − θ̂), where θ ∼ p(θ). Note that ∂h∂θ(zi ) is
the parameter-output Jacobian matrix, or a linear approximation of the neural network mapping
θ → hL . The complexity lower bounds in eq. (17) mean how the local parameter change (θ − θ̂)
w.r.t. the prior p(θ) affect the output. If the output is sensitive to these parameter variations,
then the model is considered to have high complexity. In summary, the f -mean offers a powerful
tool to analyze our model complexity and obtain its approximations.
where diag (·) means a diagonal matrix constructed with given entries, and σ > 0 (elementwisely).
Equivalently, the associated prior is pG (θ) = G(θ | 0, diag (σ)), meaning a Gaussian distribution
with mean 0 and covariance matrix diag (σ). We further assume
(A4) M has a global coordinate chart and M is homeomorphic to RD .
15
where λ+ i J(θ̂)diag (σ) denotes the i’th positive eigenvalue of J(θ̂)diag (σ). Notice that
√ √
J(θ̂)diag (σ) and diag σ J(θ̂)diag σ share the same set of non-zero eigenvalues, and
the latter is psd with rank J(θ̂) positive eigenvalues.
In our razor expressions, all terms that do not scale with the sample size N or the number
of parameters D are discarded. The first two terms on the RHS are similar to BIC [70] up to
scaling. The complexity terms (second and third terms on the RHS of eq. (18)) do not scale with
D but are bounded by the rank of the Hessian, or the observed FIM. In other words, the radical
distribution associated with zero-eigenvalues of J(θ̂) does not affect the model complexity. This
is different from previous formulations of MDL [7, 66, 67] and BIC [70]. For example, the 2nd
term on the RHS of eq. (2) increases
linearly with D, while the 2nd term on the RHS of eq. (18)
increases linearly with rank J(θ̂) ≤ D.
Interestingly, if λ+i (J(θ̂)) < σmax 1 − N , the third term on the RHS
1 1
of eq. (18) becomes
negative. In the extreme case when λ+ i (J(θ̂)) tends to zero, 2 log σmax λi (J(θ̂)) + N
1 + 1
→
rank(J(θ̂))
− 21 log N , which cancels out the model complexity penalty in the term 2 log N . In other
words, the corresponding parameter is added free (without increasing the model complexity).
Informally, we call similar terms that are helpful in decreasing the complexity while contributing
to model flexibility the negative complexity.
We have
rank(J(θ̂)) rank(J(θ̂))
X 1 X 1
log σmin λ+
i (J(θ̂)) + ≤ log λ+
i (J( θ̂)diag (σ)) +
i=1
N i=1
N
rank(J(θ̂))
X 1
≤ log σmax λ+
i (J(θ̂)) + ,
i=1
N
where σmax and σmin denote the largest and smallest elements of σ, respectively. Therefore the
term can be bounded based on the spectrum of J(θ̂). If σ = σ1, where σ > 0, then both of the
above “≤”s become equalities. In this case, we let D → ∞ and rewrite the razor in terms of the
spectrum density ρI (λ) of J(θ̂):
rank J(θ̂)
OG = − log p(X | θ̂) + EρI (λ) log (N σλ + 1) + O(1). (19)
2
Note rank J(θ̂) = d( ˆ θ̂) is the local dimensionality at θ̂, which could have a smaller order than
D, especially when N is finite. If ρI (λ) is highly concentrated around 0 as shown in [36], then
the expectation of log (N σλ + 1) can be roughly approximated as zero. This approximation is
also linked to the low intrinsic complexity of DNNs.
The Gaussian prior pG is helpful to give simple and intuitive expressions of OG . However, the
problem in choosing pG is two fold. First, it is not invariant. Under a reparametrization (e.g.
normalization or centering techniques), the Gaussian prior in the new parameter system does
not correspond to the original prior. Second, it double counts equivalent models. Because of
the many singularities of the neuromanifold, a small dynamic in the parameter system may not
change the prediction model. However, the Gaussian prior is defined in a real vector space and
may not fit in this singular semi-Riemannian structure. Gaussian distributions are defined on
Riemannian manifolds [69] which lead to potential extensions of the discussed prior pG (θ).
16
8 The Razor based on Jeffreys’ Non-informative Prior
Jeffreys’ prior is specified by pJ (θ) ∝ |I(θ)|. It is non-informative in the sense that no neural
p
network model θ1 is prioritized over any other model θ2 . It is invariant to the choice of the
coordinate system. Under a reparameterization θ → η,
s
⊺
p ∂θ ∂θ p ∂θ p
|I(η)|dη = I(θ) · dη = |I(θ)| · dη = |I(θ)|dθ,
∂η ∂η ∂η
showing that the Riemannian volume element is the same in different coordinate systems.
Unfortunately, the Jeffreys’ prior is
pnot well defined on the lightlike neuromanifold M, where
the metric I(θ) is degenerate and |I(θ)| becomes zero. The stratifold structure of M, where
d(θ) varying with θ ∈ M, makes it difficult to properly define the base measure dθ and integrate
functions as in eq. (15). From a mathematical standpoint, one has to integrate on the screen
distribution S(T M), which has a Riemannian structure. We refer the reader to [34, 74] for other
extensions of Jeffreys’ prior.
In this paper, we take a simple approach by examining a submanifold of M denoted as M f
and parameterized by ξ, which has a Riemannian metric I(ξ) ≻ 0 that is induced by the FIM
I(θ) ⪰ 0 and the mapping ξ → θ. The dimensionality of M f is upper-bounded by the local
dimensionality d(θ). Intuitively, any infinitesimal dynamic on Mf means such a change of neural
network parameters that leads to a non-zero change of the global predictive model z → y. For
example, M f can be defined based on a subset of sensitive parameters. In theory, we would like
to construct M f so that it is representative of M, meaning that dim(M) f is close to the local
dimensionality d(θ), and at the same time M f remains Riemannian. The following results are
constrained to the choice ofpthe submanifold M.
f
In eq. (15), let κ(ξ) = |I(ξ)|. We further assume
R p
(A6) 0 < M f |I(ξ)|dξ < ∞;
meaning that the Riemannian volume of M f is bounded. After straightforward derivations, we
arrive at
Z p
OJ (ξ) = − log p(X | ξ̂) + log |I(ξ)|dξ
M
f
Z
N ⊺
p
− log exp − (ξ − ξ̂) J(ξ̂)(ξ − ξ̂) |I(ξ)|dξ
Mf 2
Z p Z
(20)
p
= − log p(X | ξ̂) + log |I(ξ)|dξ − log ω(ξ) |I(ξ)|dξ,
M
f M
f
where ω(ξ) := exp − N2 (ξ − ξ̂)⊺ J(ξ̂)(ξ − ξ̂) is a shorthand. Let us examine the meaning of
OJ (ξ). As I(ξ) is the Riemannian metric of M f based on information geometry, |I(ξ)|dξ
p
is a Riemannian
R pvolume element (volume form). In the second term on the RHS of eq. (20),
the integral Mf |I(ξ)|dξ is the information volume, or the total “number” of different DNN
models [50] on M. In the last (third) term, because 0 < ω(ξ) ≤ 1, the integral on the LHS of
f
Z p Z p
ω(ξ) |I(ξ)|dξ ≤ |I(ξ)|dξ
M
f M
f
17
Assume the spectrum decomposition J(ξ̂) = Qdiag λ+
i (J(ξ̂)) Q , where Q has orthonormal
⊺
columns, and λ+i (J(ξ̂)) are the positive eigenvalues of J(ξ̂). Equation (20) becomes
Z p
OJ (ζ) = − log p(X | ζ̂) + log |I(ζ)|dξ
Mf
Z rank(J(ξ̂))
N X
(21)
p
− log exp − λ+
i J(ξ̂) (ζi − ζ̂i )2 |I(ζ)|dζ,
Mf 2 i=1
|I(ξ)|1/2
Z
1 −1
− log G ξ | ξ̂, J (ξ̂) dξ. (22)
Mf N |J(ξ̂)|1/2
By assumption (A6), the RHS of eq. (20) is well defined, while the RHS of eq. (22) is only
meaningful for a full rank J(ξ̂). If J(ξ̂) is not invertible, one can consider the limit case when the
zero eigenvalues of J(ξ̂) are replaced by a small ϵ > 0 and still apply the expression in eq. (22).
One has to note that Z
1
G ξ | ξ̂, J−1 (ξ̂) dξ ≤ 1,
M
f N
|I(ξ)|1/2
|J(ξ̂)|
Z
1 1
− log G ξ | ξ̂, J−1 (ξ̂) dξ ≈ log . (23)
Mf N |J(ξ̂)|1/2 2 |I(ξ̂)|
Under this approximation, eq. (22) gives the MDL criterion discussed in [7, 50], where the term
on the RHS of eq. (23) is interpreted as a penalty to models that lack robustness and are sensitive
to the choice of parameters. We therefore consider the spectrum of both matrices I(ξ) and J(ξ),
noting that in the large sample limit N → ∞, they become identical. Because of the finite N ,
the observed FIM J(ξ̂) is singular in potentially many directions. The true FIM I(ξ̂) can be
regarded as the sum of the observed FIM J(ξ̂) and the FIM w.r.t. unobserved samples, up to
a scaling factor. Based on how M f is constructed, I(ξ̂) ≻ 0 is positive definite and suffers less
from singularities. In the directions where J(ξ̂) is nearly singular, the log-ratio log |J(ξ̂)|/|I(ξ̂)|
contributes significantly and negatively to the model complexity. As a result, eq. (23) serves as a
18
negative complexity term and explains how singularities of J(ξ̂) correspond to the simplicity of
DNNs.
Compared with OG , OJ is based on a more accurate geometric modeling, However, it is hard
to be computed numerically, as it depends on how M f is constructed, and I(ξ) and p(z) which
are unknown due to limited observations. Despite that OG and OJ have different expressions,
their preference to model dimensions with small Fisher information (as in DNNs) is similar.
Hence, we can conclude that the intrinsic complexity of a DNN is affected by the singularity
and spectral properties of the Fisher information matrix.
9 Related Work
The dynamics of supervised learning of a DNN describes a trajectory on the parameter space of
the DNN geometrically modeled as a manifold when endowed with the FIM (e.g., ordinary/natural
gradient descent learning the parameters of a MLP). Singular regions of the neuromanifold [79]
correspond to non-identifiable parameters with rank-deficient FIM, and the learning trajectory
typically exhibits chaotic patterns [4] with the singularities which translate into slowdown plateau
phenomena when plotting the loss function value against time. By building an elementary singular
DNN, [4] (and references therein) showed that stochastic gradient descent learning dynamics yields
a Milnor-type attractor with both attractor/repulser subregions where the learning trajectory is
attracted in the attractor region, then stay a long time there before escaping through the repulser
region. The natural gradient is shown to be free of critical slowdowns. Furthermore, although
DNNs have potentially many singular regions, the interaction of elementary units cancels out the
Milnor-type attractors. It was shown [55] that skip connections are helpful to reduce the effect
of singularities. However, a full understanding of the learning dynamics [81] for generic DNN
architectures with multiple output values or recurrent DNNs is yet to be investigated.
The MDL criterion has undergone several fundamental revisions, such as the original crude
MDL [66] and refined MDL with the introduction of stochastic complexity [8, 67], and the
NML [65, 71] as a modern refinement. We refer the reader to the book [25] for a comprehensive
introduction to this area and [24] for a recent review. We should also mention that the relationship
between MDL and generalization has been explored in the PAC-MDL framework [9, 11, 25, 26, 83].
See [24] (section 6.4) for related remarks.
The relationship between MDL and information geometry is well established [3, 7, 49, 50, 67].
For example, they both rely on fundamental concepts such as the Fisher information. The
geometric complexity of statistical models is commonly formulated using tools from information
geometry [1, 7, 49, 67]. The stochastic complexity in singular mixture models can be bounded [80]
and therefore is smaller than that of regular models. On this line of research, our derivations
based on a Taylor expansion of the log-likelihood are similar to [7]. This technique is also used
for deriving natural gradient optimization for deep learning [4, 45, 57].
Recently, MDL has been ported to deep learning [10] focusing on variational methods. Practical
techniques such as weight sharing [20], binarization [32], model compression [14], etc., follow
similar principles of MDL. In the same community, many efforts have been made to develop
a theory of deep learning, for example, based on PAC-Bayes theory [53], statistical learning
theory [82], algorithmic information theory [76], information geometry [44], geometry of the DNN
mapping [62], or through defining an intrinsic dimensionality [43] that is much smaller than the
network size. Our analysis depends on J(θ̂) and therefore is related to the flatness/sharpness of
the local minima [15, 30], which is known to affect generalization. Using advanced mathematical
tools such as random matrix theory, investigations are conducted on the spectrum of the input-
output Jacobian matrix [59], the Hessian matrix w.r.t. the neural network weights [58], and the
19
FIM [27, 35, 36, 56, 60].
10 Conclusion
We consider mathematical tools from singular semi-Riemannian geometry to study the locally
varying intrinsic dimensionality of a deep learning model. These models fall in the category of
non-identifiable parameterizations. We take a meaningful step to quantify geometric singularity
through the notion of local dimensionality d(θ) yielding a singular semi-Riemannian neuromanifold
with varying metric signature. We show that d(θ) grows at most linearly with the sample size
N . Recent findings show that the spectrum of the Fisher information matrix shifts towards
0+ with a large number of small eigenvalues. We show that these singular dimensions help to
reduce the model complexity. As a result, we contribute a simple and general MDL for deep
learning. It provides theoretical insights on the description length of DNNs. DNNs benefit from a
high-dimensional parameter space in that the singular dimensions impose a negative complexity
to describe the data, which can be seen in our derivations based on Gaussian and Jeffreys’ priors.
How the short description length is connected to the empirical performance of DNNs and related
generalization bounds require further examinations. This is not addressed in the current work. A
more careful analysis of the FIM’s spectrum, e.g. through considering higher-order terms, could
give more practical formulations of the proposed criterion. We leave empirical studies as potential
future work.
where OneHot(y) is the binary vector with the same dimensionality as hL (zi ), with the y’th bit
set to 1 and the rest bits set to 0. Therefore,
L ⊺
∂ log p(yi | zi , θ) ∂h
OneHot(yi ) − SoftMax(hL (zi )) .
=
∂θ ∂θ
Therefore,
L ⊺
∂ 2 log p(yi | zi , θ) X ∂ 2 hL
j ∂h ∂hL
= OneHot(yi ) − SoftMax(hL
(zi )) − · C i · . (24)
∂θ∂θ ⊺ j
j ∂θ∂θ ⊺ ∂θ ∂θ
where
∂SoftMax(hL (zi ))
Ci = = diag (oi ) − oi o⊺i , oi = SoftMax(hL (zi )).
∂hL (zi )
20
Therefore L ⊺
∂ 2 log p(yi | zi , θ) ∂h ∂hL
∀i, − ⊺
= · Ci · .
∂θ∂θ ∂θ ∂θ
Taking the sample average on both sides, we get
J(θ̂) = Î(θ̂).
Therefore
⊺
∂hL (z) ∂hL (z)
Ep α C(z) α = 0.
∂θ ∂θ
Any eigenvector of C(z) associated with the zero eigenvalues must be a multiple of 1. Indeed,
X X
v ⊺ C(z)v = v ⊺ (diag (o(z)) − o(z)o(z)⊺ ) v = oj (z)(vj − oj (z)vj )2 = 0 ⇔ v ∝ 1,
j j
where oj (z) > 0 is the j’th element of o(z). Hence, almost surely
∂hL (z)
α = λ(z)1.
∂θ
Remark. α is associated with a tangent vector in Rad(T M), meaning a dynamic along the
L
lightlike dimensions. The Jacobian ∂h∂θ(z) is the local linear approximation of the mapping
θ → hL (z). By lemma 1, with probability 1 such a dynamic leads to uniform increments in
the output units,
meaning h (z) → h (z) + λ(z)1, ∀i, and therefore the output distribution
L L
SoftMax h (z) is not affected. In summary, we have verified that the radical distribution does
L
21
Appendix C Proof of Proposition 2
Proof.
N ⊺ !
ˆ
X ∂hL (zi ) ∂hL (zi )
d(θ) = rank Î(θ) = rank Ci
i=1
∂θ ∂θ
N ⊺ N
∂hL (zi ) ∂hL (zi )
L
X X ∂h (zi )
≤ rank Ci ≤ min rank , rank (Ci ) .
i=1
∂θ ∂θ i=1
∂θ
L
∂h (zi )
Note the matrix
∂θ
has size m × D, and Ci has size m × m and rank (m − 1). We also
ˆ
have d(θ) = rank Î(θ) ≤ D = dim(θ). Therefore
N
∂hL (zi )
X
ˆ ≤
d(θ) min rank ,m − 1 .
i=1
∂θ
⊺
∂hL (z)∂hL (z)
d(θ) = rank (I(θ)) = rank Ep(z) C(z) .
∂θ ∂θ
h L ⊺ L
i
That means d(θ) is the dimensionality of the image of Ep(z) ∂h∂θ(z) C(z) ∂h∂θ(z) .
∀θ, we have
L ⊺ ⊺
∂hL (z)
L
∂hL (z)
∂h (z) ∂h (z)
Ep(z) C(z) θ = Ep(z) C(z) θ
∂θ ∂θ ∂θ ∂θ
L ⊺
∂h (z)
= Ep(z) β(z) ,
∂θ
L
where β(z) = C(z) ∂h∂θ(z) θ is an m-dimensional vector. Therefore
L ⊺
∂hL (z)
L
∂h (z) [ ∂h (z)
Ep(z) C(z) θ ∈ span Row .
∂θ ∂θ ∂θ
z∈supp(p)
Letting θ vary in RD , and applying dim(·) on both sides, the statement follows immediately.
l=1
22
Therefore,
∂hL (z)
l−1
l z
∀w ∈ Rdim(w ) , w = W L L−1
Φ W L−1
· · · Φl
mat(w) ,
∂wl 1
L−1
= min rank (Φs ) .
s=l
where ℓ is the log-likelihood, and ℓi = log p(yi | zi , θ). We write the analytical form of the
elementwise Hessian
m
∂ 2 ℓi X ∂hL
j (zi )
⊺
= ⊺
(OneHotj (y) − SoftMaxj (hL )) − I(θ),
∂θ∂θ j=1
∂θ∂θ
where OneHot(·) denote the one-hot vector associated with the given target label y. Therefore
m
!
2 L
∂ ℓ i
X ∂hj (zi )
α⊺ α= α⊺ α (OneHotj (y) − SoftMaxj (hL )) − α⊺ I(θ)α.
∂θ∂θ ⊺ j=1
∂θ∂θ ⊺
Because of the first term on the RHS, the kernels of the two matrices J(θ) and Î(θ) are different,
and thus their ranks are also different.
23
Appendix G Proof of Proposition 5
Proof. As θ̂ is the MLE, we have J(θ̂) ⪰ 0, and ∀θ ∈ M,
N
− (θ − θ̂)⊺ J(θ̂)(θ − θ̂) ≤ 0.
2
Hence,
N
Ep exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂) ≤ 1.
2
Hence,
N
− log Ep exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂) ≥ 0.
2
This proves the first “≤”.
As − log(x) is convex, by Jensen’s inequality, we get
N ⊺
− log Ep exp − (θ − θ̂) J(θ̂)(θ − θ̂)
2
N ⊺
≤ Ep − log exp − (θ − θ̂) J(θ̂)(θ − θ̂)
2
N ⊺
= Ep (θ − θ̂) J(θ̂)(θ − θ̂)
2
N ⊺
= tr Ep J(θ̂)(θ − θ̂)(θ − θ̂)
2
N
= tr J(θ̂) (µ(θ) − θ̂)(µ(θ) − θ̂)⊺ + cov(θ) .
2
This proves the second “≤”.
Pn
This proves the first “≤”. To prove the second “≤”, we note that − log n1 i=1 exp(−ti ) is a
concave function. Therefore
m
" n
!#
1 X 1X
{Mf (t:,1 ), · · · , Mf (t:,m )} = − log exp(−tij )
m j=1 n i=1
n m
1 X 1 X
≤ − log exp − tij = Mf {t1 , · · · , tn } .
n i=1 m j=1
24
The last “≤” is based on the convexity of − log t. Once again, by Jensen’s inequality, we have
n m
1X 1 X
Mf {t1 , · · · , tn } ≤ − log exp − tij = T .
n i=1 m j=1
Appendix I Derivations of OG
We recall the general formulation in eq. (15):
Z
O := − log p(X | θ̂) + log κ(θ)dθ
M
Z
N
− log κ(θ) exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂) dθ.
M 2
Z Z
1 ⊺ 1
log κ(θ)dθ = log exp − θ diag θ dθ
M M 2 σ
D 1
= log 2π + log |diag (σ) |
2 Z 2
D 1 1 ⊺ 1
+ log exp − log 2π − log |diag (σ) | − θ diag θ dθ
M 2 2 2 σ
D 1 D 1
= log 2π + log |diag (σ) | + log 1 = log 2π + log |diag (σ) |.
2 2 2 2
The third (last) term on the RHS is
Z
N ⊺
− log κ(θ) exp − (θ − θ̂) J(θ̂)(θ − θ̂) dθ
M 2
Z
1 ⊺ 1 N ⊺
= − log exp − θ diag θ − (θ − θ̂) J(θ̂)(θ − θ̂) dθ
M 2 σ 2
Z
1
= − log exp − θ ⊺ Aθ + b⊺ θ + c dθ,
M 2
where
1 N ⊺
A = N J(θ̂) + diag ≻ 0, b = N J(θ̂)θ̂, c=− θ̂ J(θ̂)θ̂.
σ 2
25
Then,
Z
N
− log κ(θ) exp − (θ − θ̂)⊺ J(θ̂)(θ − θ̂) dθ
M 2
Z
1 1
= − log exp − (θ − θ̄)⊺ A(θ − θ̄) + c + θ̄ ⊺ Aθ̄ dθ
M 2 2
D 1 1 ⊺
= − log 2π + log |A| − c − θ̄ Aθ̄
2 Z 2 2
D 1 1
− log exp − log 2π + log |A| − (θ − θ̄)⊺ A(θ − θ̄) dθ
M 2 2 2
D 1 1 ⊺
= − log 2π + log |A| − c − θ̄ Aθ̄,
2 2 2
where Aθ̄ = b. To sum up,
D 1
OG = − log p(X | θ̂) + log 2π + log |diag (σ) |
2 2
D 1 1 ⊺
− log 2π + log |A| − c − θ̄ Aθ̄
2 2 2
1 1 1
= − log p(X | θ̂) + log |diag (σ) | + log |A| − c − θ̄ ⊺ Aθ̄,
2 2 2
1 1 1
= − log p(X | θ̂) + log |diag (σ) | + log |N J(θ̂) + diag |
2 2 σ
⊺ −1
N ⊺ 1 1
+ θ̂ J(θ̂)θ̂ − N J(θ̂)θ̂ N J(θ̂) + diag N J(θ̂)θ̂
2 2 σ
1
= − log p(X | θ̂) + log |N J(θ̂)diag (σ) + I|
2
⊺ −1
1 1 1 1
+ θ̂ ⊺ J(θ̂) J(θ̂) + diag diag θ̂
2 N σ σ
−1
1 1 ⊺ 1
= − log p(X | θ̂) + log |N J(θ̂)diag (σ) + I| + θ̂ J(θ̂) diag (σ) J(θ̂) + I θ̂.
2 2 N
The last term does not scale with N and has a smaller order as compared to other terms. Indeed,
−1
1 1
lim J(θ̂) J(θ̂) + diag = J(θ̂)J(θ̂)+ ,
N →∞ N σ
By assumption (A5), the RHS is O(1). This term is therefore dropped. We get
1
OG = − log p(X | θ̂) + log N J(θ̂)diag (σ) + I + O(1).
2
26
Note that rank J(θ̂) ≤ D, and the matrix J(θ̂)diag (σ) has the same rank as J(θ̂). We can
write J(θ̂) = L(θ̂)L(θ̂)⊺ , where L(θ̂) has shape D × rank J(θ̂) . We abuse I to denote both
the identity matrix of shape D × D and the identity matrix of shape rank J(θ̂) × rank J(θ̂) .
By the Weinstein–Aronszajn identity,
1
OG = − log p(X | θ̂) + log N L(θ̂)L(θ̂)⊺ diag (σ) + I + O(1)
2
1
= − log p(X | θ̂) + log N L(θ̂)⊺ diag (σ) L(θ̂) + I + O(1)
2
rank J(θ̂) 1 1
= − log p(X | θ̂) + log N + log L(θ̂)⊺ diag (σ) L(θ̂) + I + O(1).
2 2 N
Denote the largest and smallest elements of σ as σmax and σmin , respectively. Then,
Hence,
1 1 1 1
log L(θ̂)⊺ diag (σ) L(θ̂) + I ≤ log σmax L(θ̂)⊺ L(θ̂) + I
2 N 2 N
rank(J(θ̂))
1 X 1
= log σmax λ+
i (J(θ̂)) + .
2 i=1
N
Similarly,
rank(J(θ̂))
1 1 1 X
+ 1
log L(θ̂)⊺ diag (σ) L(θ̂) + I ≥ log σmin λi (J(θ̂)) + .
2 N 2 i=1
N
If σ = σ1, then σmax = σmin = σ. Both “≤” and “≥” in the above inequalities become tight.
27
is the sub-manifold associated with the frame θ s = (θ1 , · · · , θd ), so that T Ms = S(T M), and
the induced Riemannian volume element as
p
dθ s = |I(θ s )| dθ1 ∧ dθ2 ∧ · · · ∧ dθd
(25)
p
= |I(θ s )| dE θ s ,
where dE θ is the Euclidean volume element. We artificially shift θ to be positive definite and
define the volume element as
p
dθ := |I(θ) + ε1 I| dθ1 ∧ dθ2 ∧ · · · ∧ dθD
(26)
p
= |I(θ) + ε1 I| dE θ s ,
where ε1 > 0 is a very small value as compared to the scale of I(θ) given by D 1
tr(I(θ)), i.e. the
average of its eigenvalues. Notice this element will vary with θ: different coordinate systems will
yield different volumes. Therefore it depends on how θ can be uniquely specified. This is roughly
guaranteed by our assumption that the θ-coordinates correspond to the input coordinates (weights
and biases) up to an orthogonal transformation. Despite that eq. (26) is a loose mathematical
definition, it makes intuitive sense and is convenient for making derivations. Then, we can
integrate functions Z Z
(27)
p
f (θ)dθ = f (θ) |I(θ) + ε1 I| dE θ,
M
“razor” of the model Ms . However, we will instead use aR Gaussian-like prior, because Jeffreys’
prior is not well defined on M. Moreover, the integral Ms |I(θ s )|dE θ s is likely to diverge
p
based on our revised volume element in eq. (26). If the parameter space is real-valued, one can
easily check that, the volume based on eq. (26) along the lightlike dimensions will diverge. The
zero-centered Gaussian prior corresponds to a better code, because it is commonly acknowledged
that one can achieve the same training error and generalization without using large weights. For
example, regularizing the norm of the weights is widely used in deep learning. By using such an
informative prior, one can have the same training error in the first term in eq. (2), while having a
smaller “complexity” in the rest of the terms, because we only encode such models with constrained
weights. Given the DNN, we define an informative prior on the lightlike neuromanifold
1 1
(29)
2
p
p(θ) = exp − 2 ∥θ∥ |I(θ) + ε1 I|,
V 2ε2
Here, the base measure is the Euclidean volume element dE θ, as |I(θ) + ε1 I| already appeared
p
in p(θ). Keep in mind, again, that this p(θ) is defined in a special coordinate system, and is not
invariant to re-parametrization. This distribution is also isotropic in the input coordinate system,
which agrees with initialization techniques7 .
7 Different layers, or weights and biases, may use different variance in their initialization. This minor issue can
28
This bi-parametric prior connects Jeffreys’ prior (that is widely used in MDL) and a Gaussian
prior (that is widely used in deep learning). If ε2 → ∞, ε1 → 0, it coincides with Jeffreys’ prior (if
it is well defined and I(θ) has full rank); if ε1 is large, the metric (I(θ) + ε1 I) becomes spherical,
and eq. (29) becomes a Gaussian prior. We refer the reader to [34, 74] for other extensions of
Jeffreys’ prior.
The normalizing constant of eq. (29) is an information volume measure of M, given by
Z
1
V := exp − 2 ∥θ∥ dθ. 2
(30)
M 2ε2
Unlike Jeffreys’ prior whose information volume (the 3rd term on the RHS of eq. (2)) can be
unbounded, this volume can be bounded as stated in the following theorem.
Theorem 7. √
(31)
p
( 2πε1 ε2 )D ≤ V ≤ ( 2π(ε1 + λm )ε2 )D ,
where λm is the largest eigenvalue of the FIM I(θ).
Notice λm may not exist, as the integration is taken over θ ∈ M. Intuitively, V is a weighted
volume w.r.t. a Gaussian-like prior distribution on M, while the 3rd term on the RHS of eq. (2) is
an unweighted volume. The larger the radius ε2 , the more “number” or possibilities of DNNs are
included; the larger the parameter ε1 , the larger the local volume element in eq. (26) is measured,
and therefore the total volume is measured larger. log V is an O(D) terms, meaning the volume
grows with the number of dimensions.
D
! D2 D2
p Y 1 1
|I(θ) + ε1 I| = (λi + ε1 ) D
≤ tr(I(θ)) + ε1 .
i=1
D
Therefore
√ D 1 D2
V ≤ 2πε2 tr(I(θ)) + ε1 .
D
29
If one applies 1
D tr(I(θ)) ≤ λm to the RHS, the upper bound is further relaxed as
√ D D
p D
V ≤ 2πε2 (λm + ε1 ) 2 = 2π(ε1 + λm )ε2 .
In the last term on the RHS, inside the parentheses is a quadratic function w.r.t. θ. However the
integration is w.r.t. to the non-Euclidean volume element dθ and therefore does not have closed
form. We need to assume
1 p q
, |I(θ) + ε1 I| ≈ |I(θ̂) + ε1 I|, exp log p(X | θ̂) = p(X | θ̂)
V
can all be taken out of the integration as constant scalers, as they do not depend on θ. The main
difficulty is to perform the integration
∥θ∥2
Z
N ⊺
exp − 2 − (θ − θ̂) J(θ̂)(θ − θ̂) dE θ
2ε2 2
Z
1 ⊺ ⊺
= exp − θ Aθ + b θ + c dE θ
2
Z
1 −1 ⊺ −1 1 ⊺ −1
= exp − (θ − A b) A(θ − A b) + b A b + c dE θ
2 2
Z
1 ⊺ −1 1 −1 ⊺ −1
= exp b A b+c exp − (θ − A b) A(θ − A b) dE θ
2 2
1 ⊺ −1 D 1
= exp b A b + c exp log 2π − log |A|
2 2 2
1 ⊺ −1 D 1
= exp b A b + c + log 2π − log |A| .
2 2 2
where
1 1
A = N J(θ̂) + I, b = N J(θ̂)θ̂, c = − θ̂ ⊺ N J(θ̂)θ̂.
ε22 2
The rest of the derivations are straightforward. Note R = −c − 12 b⊺ A−1 b.
30
After derivations and simplifications, we get
D N
− log p(X) ≈ − log p(X | θ̂) + log + log V
2 2π
1 1 1
+ log J(θ̂) + 2 I − log I(θ̂) + ε1 I + R. (32)
2 N ε2 2
We need to analyze the order of this R term. Assume the largest eigenvalue of J(θ̂) is λm , then
N λm
|R| ≤ ∥θ̂∥2 . (34)
ε22 N λm + 1
We assume
(A8) The ratio between the scale of each dimension of the MLE θ̂ to ε2 , i.e. θ̂i
ε2 (i = 1, · · · , D)
is in the order O(1).
Intuitively, the scale parameter ε2 in our prior p(θ) in eq. (29) is chosen to “cover” the good
models. Therefore, the order of R is O(D). As N turns large, R will be dominated by the 2nd
O(D log N ) term. We will therefore discard R for simplicity. It could be useful for a more delicate
analysis. In conclusion, we arrive at the following expression
D N 1 J(θ̂) + N1ε2 I
O := − log p(X | θ̂) + log + log V + log 2
. (35)
2 2π 2 I(θ̂) + ε1 I
Notice the similarity with eq. (2), where the first two terms on the RHS are exactly the same.
The 3rd term is an O(D) term, similar to the 3rd term in eq. (2). It is bounded according to
theorem 7, while the 3rd term in eq. (2) could be unbounded. Our last term is in a similar form to
the last term in eq. (2), except it is well defined on lightlike manifold. If we let ε2 → ∞, ε1 → 0,
we get exactly eq. (2) and in this case O = χ. As the number of parameters D turns large, both
the 2nd and 3rd terms will grow linearly w.r.t. D, meaning that they contribute positively to
the model complexity. Interestingly, the fourth term is a “negative complexity”. Regard N1ε2 and
2
ϵ1 as small positive values. The fourth term essentially is a log-ratio from the observed FIM to
the true FIM. For small models, they coincide, because the sample size N is large based on the
model size. In this case, the effect of this term is minor. For DNNs, the sample size N is very
limited based on the huge model size D. Along a dimension θi , J(θ) is likely to be singular as
stated in proposition 2, even if I has a very small positive value. In this case, their log-ratio will
be negative. Therefore, the razor O favors DNNs with their Fisher-spectrum clustered around 0.
In fig. 2, model C displays the concepts of a DNN, where there are many good local optima. The
performance is not sensitive to specific values of model parameters. On the lightlike neuromanifold
M, there are many directions that are very close to being lightlike. When a DNN model varies
along these directions, the model slightly changes in terms of I(θ), but their prediction on the
samples measured by J(θ) are invariant. These directions count negatively towards the complexity,
because these extra freedoms (dimensions of θ) occupy almost zero volume in the geometric sense,
and are helpful to give a shorter code to future unseen samples.
31
B
Truth
A
C
Figure 2: A: a model far from the truth (underlying distribution of observed data); B: close to
the truth but sensitive to parameter; C (deep learning): close to the truth with many good local
optima.
To obtain a simpler expression, we consider the case that I(θ) ≡ I(θ̂) is both constant and
diagonal in the interested region defined by eq. (29). In this case,
D 1
log V ≈ log 2π + D log ε2 + log |I(θ̂) + ε1 I|. (36)
2 2
On the other hand, as D → ∞, the spectrum of the FIM I(θ) will follow the density ρI (θ). We
plug these expressions into eq. (35), discard all lower-order terms, and get a simplified version of
the razor
D ∞
Z
D 1
O ≈ − log p(X | θ̂) + log N + ρI (λ) log λ + dλ, (37)
2 2 0 N ε22
where ρI denotes the spectral density of the Fisher information matrix.
References
[1] Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans. Automat.
Contr., 19(6):716–723, 1974.
[2] Guillaume Alain, Nicolas Le Roux, and Pierre-Antoine Manzagol. Negative eigenvalues of
the Hessian in deep neural networks. In ICLR’18 workshop, 2018. arXiv:1902.02366 [cs.LG].
[3] Shun-ichi Amari. Information Geometry and Its Applications, volume 194 of Applied Mathe-
matical Sciences. Springer, Japan, 2016.
[4] Shun-ichi Amari, Tomoko Ozeki, Ryo Karakida, Yuki Yoshida, and Masato Okada. Dynamics
of learning in MLP: Natural gradient and singularity revisited. Neural Computation, 30(1):1–
33, 2018.
[5] Toshiki Aoki and Katsuhiko Kuribayashi. On the category of stratifolds. Cahiers de
Topologie et Géométrie Différentielle Catégoriques, LVIII(2):131–160, 2017. arXiv:1605.04142
[math.CT].
[6] Oguzhan Bahadir and Mukut Mani Tripathi. Geometry of lightlike hypersurfaces of a
statistical manifold, 2019. arXiv:1901.09251 [math.DG].
[7] Vijay Balasubramanian. MDL, Bayesian inference and the geometry of the space of probability
distributions. In Advances in Minimum Description Length: Theory and Applications, pages
81–98. MIT Press, Cambridge, Massachusetts, 2005.
32
[8] A. Barron, J. Rissanen, and Bin Yu. The minimum description length principle in coding
and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
[9] A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Transactions
on Information Theory, 37(4):1034–1054, 1991.
[10] Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances
in Neural Information Processing Systems 31, pages 2216–2226. Curran Associates, Inc., NY
12571, USA, 2018.
[11] A. Blum and J. Langford. PAC-MDL bounds. In Proc. Sixteenth Conf. Learning Theory
(COLT’ 03), pages 344–357, 2003.
[16] Krishan Duggal. A review on unique existence theorems in lightlike geometry. Geometry,
2014, 2014. Article ID 835394.
[17] Krishan Duggal and Aurel Bejancu. Lightlike Submanifolds of Semi-Riemannian Manifolds
and Applications, volume 364 of Mathematics and Its Applications. Springer, Netherlands,
1996.
[18] Pascal Mattia Esser and Frank Nielsen. Towards modeling and resolving singular parameter
spaces using stratifolds. arXiv preprint arXiv:2112.03734, 2021.
[19] Xinlong Feng and Zhinan Zhang. The rank of a random matrix. Applied Mathematics and
Computation, 185(1):689–694, 2007.
[20] Adam Gaier and David Ha. Weight agnostic neural networks. In Advances in Neural
Information Processing Systems 32, pages 5365–5379. Curran Associates, Inc., NY 12571,
USA, 2019.
[21] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence
and Statistics (AISTATS), pages 249–256, 2010.
[22] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In
International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings
of Machine Learning Research, pages 315–323, 2011.
[23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, Cambridge,
Massachusetts, 2016.
33
[24] Peter Grünwald and Teemu Roos. Minimum description length revisited. International
Journal of Mathematics for Industry, 11(01):1930001, 2019.
[25] Peter D. Grünwald. The Minimum Description Length Principle. Adaptive Computation
and Machine Learning series. The MIT Press, Cambridge, Massachusetts, 2007.
[26] Peter D. Grünwald and Nishant A. Mehta. A tight excess risk bound via a unified PAC-
Bayesian–Rademacher–Shtarkov–MDL complexity. In Aurélien Garivier and Satyen Kale,
editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory,
volume 98 of Proceedings of Machine Learning Research, pages 433–465, 2019.
[27] Tomohiro Hayase and Ryo Karakida. The spectrum of Fisher information of deep networks
achieving dynamical isometry. In International Conference on Artificial Intelligence and
Statistics, pages 334–342, 2021.
[28] Masahito Hayashi. Large deviation theory for non-regular location shift family. Annals of
the Institute of Statistical Mathematics, 63(4):689–716, 2011.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE
International Conference on Computer Vision (ICCV), pages 1026–1034, 2015.
[30] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42,
1997.
[31] Harold Hotelling. Spaces of statistical parameters. Bull. Amer. Math. Soc, 36:191, 1930.
[32] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio.
Binarized neural networks. In Advances in Neural Information Processing Systems 29, pages
4107–4115. Curran Associates, Inc., NY 12571, USA, 2016.
[33] Varun Jain, Amrinder Pal Singh, and Rakesh Kumar. On the geometry of lightlike submani-
folds of indefinite statistical manifolds, 2019. arXiv:1903.07387 [math.DG].
[34] Ruichao Jiang, Javad Tavakoli, and Yiqiang Zhao. Weyl prior and Bayesian statistics.
Entropy, 22(4), 2020.
[35] Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Universal statistics of Fisher information
in deep neural networks: Mean field approach. In International Conference on Artificial
Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages
1032–1041, 2019.
[36] Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Pathological Spectra of the Fisher In-
formation Metric and Its Variants in Deep Neural Networks. Neural Computation, 33(8):2274–
2307, 2021.
[37] David C Kay. Schaum’s outline of theory and problems of tensor calculus. McGraw-Hill,
New York, 1988.
[38] Andreı̆ Nikolaevich Kolmogorov. Sur la notion de la moyenne. G. Bardi, tip. della R. Accad.
dei Lincei, Rome, Italy, 1930.
[39] Osamu Komori and Shinto Eguchi. A unified formulation of k-Means, fuzzy c-Means and
Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy, 23(5):518, 2021.
34
[40] Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical Fisher
approximation for natural gradient descent. In Advances in Neural Information Processing
Systems 32, pages 4158–4169. Curran Associates, Inc., NY 12571, USA, 2019.
[41] D.N. Kupeli. Singular Semi-Riemannian Geometry, volume 366 of Mathematics and Its
Applications. Springer, Netherlands, 1996.
[45] Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, and
Mark Schmidt. Simplifying momentum-based positive-definite submanifold optimization
with applications to deep learning. In International Conference on Machine Learning, pages
21026–21050. PMLR, 2023.
[46] David J.C. MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute
of Technology, 1992.
[47] James Martens. New insights and perspectives on the natural gradient method. Journal of
Machine Learning Research, 21(146):1–76, 2020.
[48] James A. Mingo and Roland Speicher. Free Probability and Random Matrices, volume 35 of
Fields Institute Monographs. Springer, New York, 2017.
[49] Noboru Murata, Shuji Yoshizawa, and Shun-ichi Amari. Network information criterion-
determining the number of hidden units for an artificial neural network model. IEEE
transactions on neural networks, 5(6):865–872, 1994.
[50] In Jae Myung, Vijay Balasubramanian, and Mark A. Pitt. Counting probability distributions:
Differential geometry and model selection. Proceedings of the National Academy of Sciences,
97(21):11170–11175, 2000.
[51] Mitio Nagumo. Über eine Klasse der Mittelwerte. In Japanese journal of mathematics:
transactions and abstracts, volume 7, pages 71–79. The Mathematical Society of Japan, 1930.
[52] Naomichi Nakajima and Toru Ohmoto. The dually flat structure for singular models.
Information Geometry, 4(1):31–64, 2021.
[53] Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring
generalization in deep learning. In Advances in Neural Information Processing Systems 30,
pages 5947–5956. Curran Associates, Inc., NY 12571, USA, 2017.
[54] Katsumi Nomizu, Nomizu Katsumi, and Takeshi Sasaki. Affine differential geometry:
geometry of affine immersions. Cambridge Tracts in Mathematics. Cambridge university
press, Cambridge, United Kingdom, 1994.
35
[55] A Emin Orhan and Xaq Pitkow. Skip connections eliminate singularities. In International
Conference on Learning Representations (ICLR), 2018.
[56] Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal
of Machine Learning Research, 21(252):1–64, 2020.
[57] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In
International Conference on Learning Representations (ICLR), 2014.
[58] Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random
matrix theory. In International Conference on Machine Learning, volume 70 of Proceedings
of Machine Learning Research, pages 2798–2806, 2017.
[59] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral
universality in deep networks. In International Conference on Artificial Intelligence and
Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1924–1932, 2018.
[60] Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of a
single-hidden-layer neural network. In Advances in Neural Information Processing Systems
31, pages 5410–5419. Curran Associates, Inc., NY 12571, USA, 2018.
[61] David Pollard. A note on insufficiency and the preservation of Fisher information. In From
Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in
Honor of Jon A. Wellner, pages 266–275. Institute of Mathematical Statistics, Beachwood,
Ohio, 2013.
[62] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On
the expressive power of deep neural networks. In International Conference on Machine
Learning, volume 70 of Proceedings of Machine Learning Research, pages 2847–2854, 2017.
[63] Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation
of statistical parameters. Bulletin of Cal. Math. Soc., 37(3):81–91, 1945.
[64] Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation
of statistical parameters. In Breakthroughs in statistics, pages 235–247. Springer, New York,
NY, 1992.
[65] J. Rissanen. Strong optimality of the normalized ml models as universal codes and information
in data. IEEE Transactions on Information Theory, 47(5):1712–1717, 2001.
[66] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
[67] Jorma Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory,
42(1):40–47, 1996.
[68] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical
analysis of the Hessian of over-parametrized neural networks. In ICLR’18 workshop, 2018.
arXiv:1706.04454 [cs.LG].
[69] Salem Said, Hatem Hajri, Lionel Bombrun, and Baba C Vemuri. Gaussian distributions
on Riemannian symmetric spaces: statistical learning with structured covariance matrices.
IEEE Transactions on Information Theory, 64(2):752–772, 2017.
[70] Gideon Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978.
36
[71] Y. M. Shtarkov. Universal sequential coding of single messages. Problems of Information
Transmission, 23(3):3–17, 1987.
[72] Alexander Soen and Ke Sun. On the variance of the Fisher information for deep learning. In
Advances in Neural Information Processing Systems 34, pages 5708–5719, NY 12571, USA,
2021. Curran Associates, Inc.
[73] Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning
large modular models. In International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 3289–3298, 2017.
[74] Junnichi Takeuchi and S-I Amari. α-parallel prior and its properties. IEEE Transactions on
Information Theory, 51(3):1011–1023, 2005.
[75] Philip Thomas. Genga: A generalization of natural gradient ascent with positive and negative
convergence results. In International Conference on Machine Learning, volume 32 (2) of
Proceedings of Machine Learning Research, pages 1575–1583, 2014.
[76] Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes
because the parameter-function map is biased towards simple functions. In International
Conference on Learning Representations (ICLR), 2019.
[77] Christopher Stewart Wallace and D. M. Boulton. An information measure for classification.
Computer Journal, 11(2):185–194, 1968.
[78] Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory, volume 25 of Cam-
bridge Monographs on Applied and Computational Mathematics. Cambridge University Press,
Cambridge, United Kingdom, 2009.
[79] Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics
of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
[80] Keisuke Yamazaki and Sumio Watanabe. Singularities in mixture models and upper bounds
of stochastic complexity. Neural networks, 16(7):1029–1038, 2003.
[81] Yuki Yoshida, Ryo Karakida, Masato Okada, and Shun-ichi Amari. Statistical mechanical
analysis of learning dynamics of two-layer perceptron with multiple output units. Journal of
Physics A: Mathematical and Theoretical, 2019.
[82] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-
standing deep learning requires rethinking generalization. In International Conference on
Learning Representations (ICLR), 2017.
[83] Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE
Transactions on Information Theory, 52(4):1307–1321, 2006.
37