0% found this document useful (0 votes)

58 views72 pages

A Survey of Dimension Estimation Methods

This survey reviews various dimension estimation methods used to determine the intrinsic dimension of high-dimensional datasets, categorizing them based on the geometric information they utilize. It evaluates the performance of these methods in relation to hyperparameter selection, sample size, and robustness to noise and curvature. The paper also highlights the challenges of overfitting and the importance of accurately identifying intrinsic dimensions for effective data analysis and dimensionality reduction.

Uploaded by

r89863082

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views72 pages

A Survey of Dimension Estimation Methods

Uploaded by

r89863082

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

A Survey of Dimension Estimation Methods

James A. D. Binnie1 Pawel Dlotko2 John Harvey3 Jakub Malinowski4

Ka Man Yim5
July 21, 2025
arXiv:2507.13887v1 [stat.ML] 18 Jul 2025

Abstract
It is a standard assumption that datasets in high dimension have an internal structure which means
that they in fact lie on, or near, subsets of a lower dimension. In many instances it is important to
understand the real dimension of the data, hence the complexity of the dataset at hand. A great variety
of dimension estimators have been developed to find the intrinsic dimension of the data but there is little
guidance on how to reliably use these estimators.
This survey reviews a wide range of dimension estimation methods, categorising them by the geo-
metric information they exploit: tangential estimators which detect a local affine structure; parametric
estimators which rely on dimension-dependent probability distributions; and estimators which use topo-
logical or metric invariants.
The paper evaluates the performance of these methods, as well as investigating varying responses
to curvature and noise. Key issues addressed include robustness to hyperparameter selection, sample
size requirements, accuracy in high dimensions, precision, and performance on non-linear geometries.
In identifying the best hyperparameters for benchmark datasets, overfitting is frequent, indicating that
many estimators may not generalise well beyond the datasets on which they have been tested.

1 Introduction
Data analysis in high-dimensional space poses significant challenges due to the “curse of dimensionality”.
However, the data are often believed to lie in a latent space with lower intrinsic dimension than that of the
ambient feature space. Techniques that estimate the intrinsic dimension of data inform how we visualise
data in biological settings [88,110,115], build efficient classifiers [38], forecast time series [83], analyse crystal
structures [80] and even assess the generalisation capabilities of neural networks [5,10,29,96]. Moreover they
have been proposed as being able to distinguish human written text from AI written text [105]. Accurately
identifying the intrinsic dimension of the dataset is also key to understanding the accuracy, efficiency and
limitations of dimensionality reduction methods [7, 75, 107].
We study how estimators respond to practical challenges such as choosing appropriate hyperparameters,
and the presence of noise and curvature in data. We do so by assessing their performance on a standard set of
benchmark manifolds described in [18, 44, 89]. The datasets encompass a large range of intrinsic dimensions
(1 to 70), codimensions (0 to 72), and ambient dimensions of the embedding space (3 to 96) as well as
different geometries (flat, constant curvature, variable curvature).
Dimension estimators have been regularly surveyed, reflecting both their importance and the productivity
of the field [15, 16, 18]. As it is now nine years since the last survey, a renewed picture of the field is
appropriate. Key developments include the growth of topological data analysis, which adds new types of
1 Cardiff University, Cardiff, United Kingdom and Heilbronn Institute for Mathematical Research, Bristol, United Kingdom.

Email: [email protected]
2 Dioscuri Centre in Topological Data Analysis, IMPAN, Polish Academy of Sciences, Warsaw, Poland.
Email: [email protected]
3 Cardiff University, Cardiff, United Kingdom. Email: [email protected]
4 Wroclaw University of Science and Technology, Wroclaw, Poland and Dioscuri Centre in Topological Data Analysis, IMPAN,

Polish Academy of Sciences, Warsaw, Poland. Email: [email protected]

5 Cardiff University, Cardiff, United Kingdom. Email: [email protected]

1
Figure 1: Example of a cover and a refinement. There are points on the circle which are contained in three
elements of the initial cover. After refining, every point is contained in at most two elements. It would not
be possible to refine until each point was contained in only one element of the cover, so that dim = 1.

estimators. We pay particular attention to the geometric underpinnings of the different estimators, focussing
on the geometric information used in the estimation process and the impact this has, rather than on the local,
global, pointwise division. In carrying out this survey, we have also significantly extended the capability of
the scikit-dimension package [6] by adding a wide variety of estimators.

1.1 What is dimension?

In mathematics there are many different notions of dimension. These come from a wide variety of areas
of mathematics: geometry, topology, algebra, dynamical systems etc. In this section, we survey the main
different concepts of dimension. In the following, we will specialize to a certain subset of the concepts
discussed below;

Algebraic (Hamel) Dimension Perhaps the simplest notion of dimension which can be defined is the
algebraic dimension of a vector space V , which is the cardinality of a basis B of V over the base field F .

Topological (Lebesgue Covering) Dimension A topological space X is said to be finite dimensional

if there is some integer d such that for every open covering A of X, there is an open covering B of X that
refines A so that each x ∈ X is contained in at most d + 1 members of B. The topological dimension of
X is defined to be the smallest value of d for which this statement holds [79]. In the case of a topological
vector space, its topological dimension is equal to its Hamel dimension [106]. The concept is illustrated in
Figure 1.

Hausdorff Dimension We now specialise further to metric spaces. Let X be a metric space and fix d ≥ 0.
For each δ > 0, consider any countable cover {Ui } of X with diam(Ui ) < δ for all i, where diam(Ui ) denotes
the diameter of Ui . Define
nX d [ o
Hδd (X) = inf diam(Ui ) : X ⊂ Ui , diam(Ui ) < δ ,
i i

and then
H d (X) = lim Hδd (X).
δ→0

If d is too large, one finds H (X) = 0; if d is too small, H d (X) = +∞. The Hausdorff dimension of X is
d

dimH (X) = inf{ d ≥ 0 : H d (X) = 0},

2
equivalently the unique d0 for which
(
+∞, d < d0 ,
H d (X) = H d0 (X) ∈ [0, ∞].
0, d > d0 ,

This definition captures “fractional” scaling behavior—sets like the Cantor set or Julia sets often have non-
integer dimH [12].

Box-Counting (Minkowski-Bouligand) Dimension Let X ⊂ Rn . For each ϵ > 0, let N (ϵ) be the
minimum number of n-dimensional boxes (closed cubes) of side length ϵ required to cover X. Since the
covering elements here are restricted to equal-sized boxes, one records how N (ϵ) grows as ϵ → 0. The
box-counting dimension dB of X is defined as [33]:

log N (ϵ)
dB = lim .
ϵ→0 log 1/ϵ

In other words, if N (ϵ) ∼ ϵ−d as ϵ → 0, then dB = d. Note that this is a special (and generally coarser)
case of the Hausdorff dimension, corresponding to restricting all coverings in the Hausdorff construction to
equal-sized boxes.

Correlation Dimension The correlation dimension dC is a way to measure the dimensionality a space
with a probability measure, particularly in applications like dynamical systems and data science. The
correlation integral captures the spatial correlation between points by finding the probability that a pair of
points (x, y) is separated by a distance less than ε. Let µ be a probability measure on M .
Z
C(ϵ) = 1∥x−y∥<ϵ dµ⊗2 ,
M ×M

The correlation dimension dC is defined as:

log C(ϵ)
dC = lim .
ϵ→0 log ϵ
This dimension estimates the scaling behaviour of point-pair correlations and is especially relevant for
analysing the structure of attractors in dynamical systems. It is bounded above by the information di-
mension, another probability-based dimension, which is in turn bounded above by the Hausdorff dimension
of M [42]. Since it is dependent on the measure µ, it takes account of the varying density of attractors.

Equivalence of dimensions for manifolds Consider the specific case of a smooth submanifold M d ⊂
RD . The tangent space at a point p ∈ M d , Tp M , is isomorphic to Rd and so will have algebraic dimension
d. The Lebesgue covering dimension of a smooth manifold of dimension d is d [108]. Endowing the manifold
with the Riemannian metric inherited from RD , we can also calculate its dimension as a metric space. The
Hausdorff dimension of a Riemannian manifold is the same as its topological dimension [74], in this case d.
We also have that M d ⊂ RD has box-counting dimension d [33]. If µ is a probability measure on M which
is absolutely continuous with respect to the volume measure, then the correlation dimension is the growth
rate of tubes around the diagonal submanifold in M × M , which can again be seen to be d (by a result of
Gray and Vanhecke reported in [43]). Hence we have that for a smooth submanifold of RD of dimension d
all notions of dimension described above coincide.

1.2 Intrinsic Dimension

If a dataset is generated from a probability measure supported on some subset M ⊂ RD , the intrinsic
dimension of M can be defined using the notions of dimension described above. This is the dimension of the
latent space, where the internal structure of the data is believed to originate and on which the generating
process of the data is ultimately defined.

3
For a given dataset X ⊂ RD , the intrinsic dimension has been described as being the “minimum number
of parameters necessary to describe the data”, and presented as equivalent to the intrinsic dimensionality
of the data generating process itself [37]. However, regardless of what X is, it is always possible to embed
a 1-dimensional curve into RD in such a way that every data point in X lies on the curve (or, indeed, a
disconnected 0-manifold comprised of precisely the points of X).
When a very low-dimensional manifold of this type is used to model the data, the neighbourhood structure
of the data is not preserved. As a result, this representation of the data involves significant information loss.
Since the dataset does not come with a specification of the true latent neighbourhood structure, though, it
is not possible to determine precisely when a representation does or does not involve information loss.
One might attempt to define intrinsic dimensionality of a dataset instead as the smallest d such that
there is a d-dimensional space M which can be mapped into RD in a geometrically controlled way so that
every point of X lies in the image of M . The nature of the geometric control depends on what types of
generating process we might consider and is related to the question of what the neighbourhood structure of
X is. The weaker the geometric control is, the more embeddings are considered, until eventually every point
on X is permitted to lie on the image of a curve of possibly very high curvature.
If M is understood to be an embedded submanifold, then control on the reach of M is appropriate [59].
Indeed, as manifolds of lower reach are considered, the probability of error in estimating the dimension grows.
It is therefore not possible to assign an intrinsic dimension to a dataset. Only the underlying support of the
distribution has an intrinsic dimension and aspects of the geometry of this support and of the distribution
itself influence how challenging it is to estimate this dimension.
In this paper, we address the following core problem: construct a statistical procedure, an estimator
N
dˆN : RD → R,

such that if the finite point cloud X = {x1 , x2 , . . . , xN } ⊂ RD is independently identically distributed
according to an unknown distribution supported on a subset M of true intrinsic dimension d then dˆN
satisfies the consistency requirement

lim Pr |dˆN − d| > ε = 0.

∀ ε > 0,
N →∞

Equivalently, dˆN must converge (in probability, or almost surely under stronger assumptions) to the true
dimension d as the sample size N grows. Beyond mere convergence, one typically seeks quantifiable rates of
convergence and robustness to perturbations—such as additive noise or non-zero curvature on M .
Many estimators which we will consider take a hyperparameter k which defines a local region in the
dataset. Consistency will sometimes be known under the additional assumptions that k → ∞ while k/N → 0.
The first condition requires that every local neighbourhood contains arbitrarily many points, while the second
requires the local neighbourhoods to be arbitrarily small in comparison to the global dataset.
This paper will consider, compare and contrast a variety of estimators, evaluating them with respect to
their computational feasibility, their consistency and convergence on a set of benchmark synthetic datasets
and their sensitivity to hyperparameter selection.

1.3 High-dimensional phenomena

A set of N points in Euclidean space RD will always, if D > N , lie on an affine subspace of dimension
N − 1. This is the simplest example of a general principle that, in a high-dimensional space, a large number
of points are necessary for the additional dimensions to be evident from the geometry of the point set alone.
It is clear from the relationship between volume and dimension that, to fully explore a region of D-
dimensional space, a number of points which grows exponentially with D is needed. However, this need for
a large sample size is still present even for the high dimensionality to have a much more limited influence on
geometry. For a set of N points, if we require that the points can be mapped to a lower-dimensional space
with relative distance distortion ε, the Johnson–Lindenstrauss Lemma [47] shows that the lower-dimensional
space can have dimension 8ε−2 log N , so that in the presence of any deviation from linearity the number of
points needed to detect the dimensionality of the space by projection grows exponentially with D.

4
Uniform distributions of points in high-dimensional balls and cubes have been widely studied and demon-
strate a variety of interesting behaviours that create statistical challenges. This is discussed in more detail
in Section 2.6.
All of these phenomena both provide a compelling reason to carry out dimension reduction as well as
creating a challenge in estimating the dimension of higher-dimensional spaces.

1.4 Notation
In this paper, we assume that our finite sample lies on a smooth d-dimensional manifold M d ⊂ RD . Con-
cretely, let M d be a compact, smooth submanifold of RD , and let

f : Md → R

be a probability density on M d (with respect to the d-dimensional Hausdorff measure). We write

X = { x1 , x2 , . . . , xN } ⊂ M d

for a finite sample of size N , drawn i.i.d. from the distribution defined by f .
Many dimension estimation procedures are local in nature, so for each point p of interest we need to
describe a neighbourhood of p within X.
For any point p ∈ RD , let rk (p) denote the Euclidean distance from p to its kth nearest neighbour in X.
If p ∈ X, then by convention we do not count p itself as its own nearest neighbour, so that r1 (p) > 0. The
ratio between two nearest neighbours will be denoted by

ri (p)
ρi,j (p) = .
rj (p)

We write
knn(p; k) = x ∈ X : ∥x − p∥ ≤ rk (p)
for the set of k nearest neighbours of p in X. When the value of k is clear from context, we simply write
knn(p). For any ε > 0, define
B(p, ε) = x ∈ X : ∥x − p∥ < ε .
Thus B(p, ε) is the subset of X lying inside the open Euclidean ball of radius ε centered at p. Denote by
N (p, ε) the number of points in B(p, ε), not counting p itself.
Both knn(p) and B(p, ε) provide local neighbourhoods of p within X, which we will use for various local
dimension estimates.

2 Families of Estimators
The authors have systematically searched the literature to identify papers on dimension estimation pub-
lished after the most recent survey ( [16], which provides a good contextual overview of methods previous
to 2015/16), indexed by Google Scholar using the search terms “intrinsic dimension” and “dimension esti-
mation”. Another survey of previous techniques from the same time [54] focuses on fractal dimensions.
This search is complicated by the vast number of diverse disciplines which are interested in the problem.
The disparate nature of the work in the field means that new estimators are not always well contextualised
within the historical research. Practitioners often appear unaware of well-established techniques.
While we place a focus on papers published after 2015, we also seek to provide a description of the most
essential established techniques. This paper cannot describe every estimator in detail, and our reference list
may not give an entirely comprehensive description of recent developments. We encourage practitioners to
consider more recent estimators rather than just relying on those that are familiar.
The dimension estimators which have been described in the literature fall into several broad families
depending on which underlying geometric information is used to infer the intrinsic dimension.

5
2.1 Tangential Estimators
A small neighbourhood in a smooth manifold can always be viewed as the graph of a function in the following
way. Let p ∈ M d ⊂ RD . After translating and rotating, we may assume that p = 0 and Tp M = Rd . Then,
near p, the manifold is the graph of a function F : Rd → RD and, furthermore, the differential dF is the
standard embedding Rd → RD along the first d co-ordinates.
The points in knn(p; k) or in B(p, ε), for sufficiently small k or ε, are therefore concentrated around an
affine subspace of RD which is the tangent space to the manifold. Many estimators seek to identify the
dimension of the manifold by finding this affine subspace. In this section we describe some of these, which
we refer to as tangential estimators.

2.1.1 Principal components analysis

Principal components analysis (PCA) is a long-established method for fitting an affine subspace to a sample
of points. By applying this to knn(p) or B(p, ε) we can estimate the dimension of the tangent space by
considering eigenvectors corresponding to the dominant eigenvalues. More precisely since the eigenvalues
λ1 , . . . , λD (always taken in decreasing order) of the covariance matrix of the data are in proportion to the
amount of variance explained by each principal component, the first d eigenvalues are expected to be large
while the remaining D − d eigenvalues are non-zero only in the presence of curvature (with the leading term
of the deviation given by the second fundamental form IIp ) or noise. This method is known as local principal
components analysis (lPCA) [34, 37, 70]. The key decision in designing an lPCA estimator is in determining
a suitable threshold for identifying the “large” eigenvalues.
There are a number of thresholding methods giving rise to different variations of lPCA. We discuss some
of them below.

Threshold parameters Some methods involve setting a parameter value which can be used to determine
which eigenvalues are “large”. As described below in Section 3.4, the output of the estimator can be very
sensitive to the selected value.
Fukunaga and Olsen [37] compare each eigenvalue to the largest, identifying an eigenvalue as “large” if
it exceeds a certain proportion of the largest one. This gives an estimated dimension of max{u : λu > αλ1 }
for some fixed parameter 0 < α < 1.
Fan et al . use two different thresholding methods and recommend using the lower of the two estimates [34].
One is to consider the “gap” λλu+1 u
. If this number is large, there is little additional explained variance in
adding a dimension, and a parameter β > 1 may be chosen so that the estimated dimension is min{u :
λu
λu+1 < β}.
Another
Pu is based onPthe total proportion of the variance estimated, so that the dimension is given by
N
min{u : i=1 λi > γ · i=1 λi } for some parameter 0 < γ < 1. This method has been validated by Lim
et al . [68], who showed that with theoretical bounds on sampling the dimension is correctly estimated with
high probability.
Kaiser takes a very simple approach to determining which eigenvalues are dominant, using only those
which are above the mean eigenvalue [52]. However, this can give very low estimates in cases of low codi-
mension. Jolliffe has suggested that this should be softened somewhat, to 70% of the mean [49]. Clearly any
other proportion could be used, so that this is a user-chosen parameter.

Probabilistic methods Other thresholding methods seek to label the dth eigenvalue as dominant if it is
sufficiently larger than it would be under a null hypothesis that the eigenvalues arise solely from random
variation. The challenge here is in selecting an appropriate distribution of eigenvalues to model the null
hypothesis.
Frontier uses the “broken stick” distribution [36]. This distribution is generated by drawing D−1 uniform
random variables on [0, 1] and arranging the lengths of the resulting D subintervals in descending order. The
assumption behind this method is that, if the data is unstructured, the eigenvalues will be distributed
according to the lengths of these subintervals. However, there does not appear to be any theoretical basis
for choosing this distribution.

6
A permutation test approach is given by Buja and Eyuboglu [14] in the context of PCA, but it can easily
be adapted for lPCA. If the local data is represented as a k × D matrix, each column can be reshuffled
independently to give an alternative dataset. Each feature has the same variance, but the internal structure
of the dataset has been removed. Carrying out PCA on these unstructured sets yields a null distribution for
the eigenvalues.
Minka [78] adopts a Bayesian model selection approach by maximising the Bayes evidence in fitting the
data to normal distributions of different dimensions, extending the maximum likelihood framework of [11].

Other methods Craddock and Flood [25] argue that eigenvalues corresponding to ‘noise’ should decay
in a geometric progression. Plotting log λi against i, these eigenvalues will therefore appear as a trailing
straight line, while dominant eigenvalues are those before the line. This gives a dimension estimate as the
value d such that λd+1 is the first point on the trailing straight line.
PCA based approaches have been augmented with an autoencoder for residual estimation [55]. For large
sample sizes, alternative methods of finding eigenvalues have been proposed [82]. Where missing features
pose a difficulty in using PCA, a method has been proposed in [88].

2.1.2 Kernel PCA

An alternative form of PCA is “kernel PCA”. Rather than considering the data points themselves, it is
possible to consider the inner products between points in some higher dimensional space. This inner product
is known as a “kernel”. If other kernels are used, such as similarity measures, it is possible to carry out PCA
on this similarity matrix to estimate the dimension.
The kernel mitigates the key limitation of PCA: that it is exact only in a linear case. It does so by
implicitly mapping the data to a very high dimensional space ϕ : X → H in the hope that the data lie on a
linear manifold in this high dimensional space, then implicitly applying PCA in that space. The embedding
itself is not carried out – there is no need to explicitly know the map ϕ. It is sufficient to know the kernel
map K : X × X → R+ which for each pair xi , xj ∈ X yields K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩. The matrix K
defined by Kij := K(xi , xj ) is thus the Gram matrix of the set of points in this high dimensional feature
space. Any symmetric positive semidefinite function can be used as a kernel, such as radial basis functions,
quadratics, etc [39]. When a linear kernel is used, kernel PCA reduces to the usual PCA.
The method of multidimensional scaling (MDS) is sometimes described as an alternative approach to PCA
based methods, aiming to embed the data set in a lower-dimensional space to minimize a strain function.
In classical MDS, the strain function to be minimised is the distortion of the Euclidean metric, while in
generalised classical MDS it could be any valid kernel function. Analysis of these cases shows that, in fact,
these are equivalent to PCA and kernel PCA respectively [39]. Despite their separate histories, then, the
methods of kernel PCA and MDS are in fact not theoretically distinct. Other forms of MDS move beyond
this, using similarity functions which do not come from inner products.
Methods developed from the MDS school of thought tend to be focussed on producing global embeddings
of the data in lower-dimensional space, rather than identifying its intrinsic dimension. Examples include
Isomap [103], Sammon mapping [91] and Shepard–Kruskal scaling [64]

2.1.3 Nearest neighbour methods

Another category of estimators aims to identify the tangent space by studying the directions to nearest
neighbours. For any point p ∈ X, the edges in the knn graph incident to p represent directions which lie
close to Tp M , provided the neighbours of p are sufficiently close.
In [70], edges are removed from the knn graph to emphasise the local structure. A neighbour pi of p
can be considered to be “behind” another neighbour pj if the angle ∡ppj pi > π/2. By removing the edge
from p to pi , only neighbours “visible” from p remain connected to p. By carrying out PCA on the edge set
incident to p, beginning with the shortest edge and adding edges in order of increasing length, it is possible
to identify which edges result in an increase in dimension. Where these edges are larger than the preceding
edges by some threshold parameter, they are identified as potentially “short-circuiting” the graph and are
also removed. The dimension estimate is the dimension of the largest simplex in the clique complex built
from this reduced knn graph.

7
The conical dimension (CDim) [112] takes a slightly different approach – its value at p is the dimension
of the smallest affine subspace W through p so that the angle at p between each element of knn(p) and W is
less than π/4. This method is proven to be correct under conditions on the reach and the sampling density,
conditions which require an exponentially large sample size in each neighbourhood.

2.2 Parametric Estimators

The assumption that the data lie on a manifold with locally approximately uniform density determines the
approximate distributions of the number of points in a particular volume, of distances to nearest neighbours,
and of angles subtended at a point by two neighbours. The dimension of the manifold is a parameter in
these probability distributions and it can be estimated using parametric methods.
In this section we considered methods based on parametric estimators. All of these methods can be
expected to be less accurate if the underlying assumptions fail. Variable density, curvature, boundary and
spatial correlation will all introduce error. Since larger neighbourhoods are more likely to suffer from these
problems, and since these phenomena are also more evident on larger neighbourhoods, it is reasonable to
expect increasing bias with k. However, as our results show, this does not always appear to be the case.
Unlike tangential estimators, the geometry and topology of the ambient space is generally not taken into
account, so that estimators can produce values which are higher than the dimension of the ambient space.

2.2.1 Volume growth

Volumes of balls have also been used for parametric estimation from an early date. If p ∈ X is a fixed point,
then the proportion of sampled points lying in a ball of radius rk (p) is, empirically, approximately Nk−1 . In
other words, recalling that f is the density on M we have
k
≃ f (p)Vd rkd (p)
N −1
for small enough rk , assuming f is almost constant near p. Here
d
π2
Vd = d
Γ( 2 + 1)

is the volume of the unit ball in Rd .

It is the rate of growth of the volume that is related to the dimension, so that another key equation is
d
N (p, ε2 ) ε2
≃
N (p, ε1 ) ε1

for 0 < ε1 < ε2 , which holds if B(p, ε2 ) is close enough to lying in an affine space and the density f can be
assumed to be constant on B(p, ε2 ). This is the basis of the estimator in [35], which was later elaborated on
in [9]
Volume-based estimators have also been developed for discrete spaces, such as lattices [73].

Correlation Integral The correlation integral CorrInt [42] is a global approach motivated by this geometry.
It is used in chaos theory to indicate the level of spatial correlation between points in the attractor. Let
∆(ε) ⊂ R2D be the neighbourhood of the diagonal given by ∆(ε) = {(x, y) : ∥x − y∥ < ε}. Then the function
1 X
C(ε) = lim 1∆(ε) (xi , xj ),
N →∞ N 2
i,j

known as the correlation integral, counts the proportion of pairs of points lying within distance ε of each
other. The growth rate of this function for small ε, C(ε) ∼ εd , gives the correlation dimension of the
attractor. This depends on the measure on the attractor rather than on the geometry of its support.

8
The value can be estimated by assuming that the fixed finite sample size N is sufficiently large that
1 X
C(ε) ≃ 1∆(ε) (xi , xj ),
N 2 i,j

and approximating the slope with

P P
log i̸=j 1∆(r2 ) (xi , xj ) − log i̸=j 1∆(r1 ) (xi , xj )
log r2 − log r1
for two suitable values 0 < r1 < r2 . Here pairs with i = j are ignored; in the large sample limit their
contribution is negligible but for a finite sample they have the potential to distort the estimate.
Variants of this estimator have been proposed in [17, 31, 63, 87].
The correlation dimension is sensitive to the measure. If the focus is purely geometric, this may not be a
good choice. The packing number method proposed by Kegl [57] instead uses the capacity dimension, which
will be described later, making this method particularly useful when the data are unevenly distributed.
The capacity dimension of a set S is defined by the equation

log(N (r))
dimcap (S) = lim ,
r→0 log(r)

where N (r) is the r-covering number (the number of open balls of radius r required to cover set S). It is
worth noting that if both the topological dimension and capacity dimension exist for a given set, they are
equal to each other. For ease of computation, the r-packing number M (r) can be used instead of N (r).
Given a metric space X with distance metric d(·, ·), the set U ⊂ X is said to be r-separated if d(x, y) ≥ r
for all distinct x, y ∈ U . The r-packing number M (r) of a set S ⊂ X is defined as

M (r) = max{|U | : U is r-separated}.

U ⊂S

It follows from the inequalities N (r) ≤ M (r) ≤ N (r/2) that

log(M (r))
dimcap (S) = lim .
r→0 log(r)

As with CorrInt, this limit can only be approximated when using a finite sample. The dimension is
estimated by the following scale-dependent formula

d cap (r1 , r2 ) = lim log(M (r2 )) − log(M (r1 )) .

dim
r→0 log(r2 ) − log(r1 )

Because finding the exact value of M (r) is an NP-hard problem, a greedy estimate of this value is used.

Volumes through graphs The methods of Kleindessner and von Luxburg are designed to provide an
estimate without using distances; instead considering only the knn graph [60]. Since the ratio of the Lebesgue
measure of the ball of radius 1 centred at x ∈ X to the Lebesgue measure of the ball of radius 2 is an injective
function of dimension, a dimension estimate dˆDP (x) can be obtained from the ratio. Similarly, the volume
of the intersection of two balls of radius 1 with different centres can be compared to the unit ball, giving the
ratio
d+1 1
S(d) = I 43 , ,
2 2
where I is the regularized incomplete beta function. Inverting this function gives another estimate. The idea
for both methods can be seen in Figure 2
To use these ratios to estimate dimension, construct the knn graph where points are connected by a
directed unweighted edge from i to j if and only if xj ∈ knn(xi ). Using the edge length metric de , the ball
of radius r is defined by Be (i, r) = {j ∈ V |de (i, j) < r} where de (i, j) is the distance from i to j in the knn
graph. They then consider the ratio of |Be (i, 1)| and |Be (i, 2)| and then average these ratios over a subset of

9
Figure 2: On the left the idea behind the doubling property estimator is represented. The ratio of the
number of points in the two balls grows with dimension. On the right, WODCap counts the number of points
in the intersection of the two balls. WODCap stands somewhat alone as being the only bi-local estimator
that we have found.

the points to recover a dimension estimate. A similar construction can be used for the intersection of two
unit balls with neighbouring centres, giving the WODCap estimator dˆCAP .
It has been shown that the sample version converges in probability to the theoretical construction as
n → ∞ in the special case where M d ⊂ Rd is a compact domain in Euclidean space such that the boundary
satisfies certain regularity assumptions and the density of M is bounded both above and below.
Theorem 2.1. [60] Let X = {x1 , . . . , xn } ⊂ M d ⊂ Rd be an i.i.d. sample from f and let G be the directed,
unweighted kNN-graph on X. Given G as input and a vertex i ∈ {1, . . . , n} chosen uniformly at random,
both dˆDP (xi ) and dˆCAP (xi ) converge to the true dimension d in probability as n → ∞ if k = k(n) satisfies
k ∈ o(n), log(n) ∈ o(k), and there exists k ′ = k ′ (n) with k ′ ∈ o(k) and log(n) ∈ o(k ′ ).
These estimators are perhaps the most sensitive to choice of neighbourhood size, as discussed later in
Section 3.4.2. The first method tends to vastly underestimate the dimension, especially with high dimensional
datasets. WODCap performs very closely to Levina and Bickel’s MLE [67] [60].
We refer the reader to [85, 86, 95] which further develops these approaches.

2.2.2 Nearest neighbour distances

Maximum likelihood estimators The earliest estimator of this type is due to Trunk [104], who used a
maximum likelihood estimator based on the joint distribution of both the ratios ρi,k (p) and the angles θi (p)
between the (i + 1)th nearest neighbour and the span of the first i nearest neighbours. For large N , the
distribution of the distance ratios can be obtained from the distribution in the Euclidean setting, but the
distribution of angles is less tractable, and an approximation was used.
Levina and Bickel applied the theory of maximum likelihood estimation to the point process to obtain an
estimator calculated from nearest neighbour distances [67]. For r small and f almost constant near p, and
approximating the binomial point process from which the N points are an observation by a Poisson point
process, the number of points within a distance r of p ∈ M has rate

λ(r) = f (p)Vd drd−1 ,

where Vd is the volume of a unit ball in Rd .

10
For a given p ∈ X, this gives rise to a maximum likelihood estimate for d, which depends on how the
neighbourhood of p is selected. Considering points in B(p, ε), the maximum likelihood estimator for d is
 −1
N (p,ε)
1 X ε 
dˆε (p) =  log ,
N (p, ε) j=1 rj (p)

where N (p, ε) is the number of points in B(p, ε). Similarly, considering points in knn(p; k), we obtain
 −1
k−1
1 X
dˆk (p) =  (log ρk,j (p)) .
k − 1 j=1

Normalising by k − 2 instead will make this estimator asymptotically unbiased (under the conditions
d2
n, k → ∞ and nk → 0) and give variance k−3 [67].
To obtain a global estimate, Levina and Bickel used the arithmetic mean of these local estimates over
all samples. However, as noted by MacKay and Ghahramani [72], making the approximate assumption that
the rates at each point are independent allows one to write down a global likelihood function. The global
maximum likelihood estimator for the dimension is not the arithmetic mean of the local estimates, but the
harmonic mean. This adjustment provides significant improvements at small k. Levina and Bickel argue
that the spatial dependence does not prevent the variance of the global estimator from behaving as N −1 ,
since for fixed k this results in N/k roughly independent groups of points.
A Bayesian approach to this problem is described in [51].
Since the natural sizes of data clusters are typically unknown, it is preferable if the MLE approach can
operate on minimal neighbourhoods, so that the majority of data points within a neighbourhood genuinely
originate from the same local distribution. The TLE [4] method is an extension of the standard MLE estimator
specifically engineered for these tight neighbourhood scenarios.
Restricting neighbourhood size inherently limits the data available for estimation. To accurately estimate
local dimensionality, it is advantageous to utilize all pairwise distances among neighbours rather than relying
solely on distances from a central reference point. Denote the point at which the dimension estimation is
carried out by x ∈ X. The simplest application of this approach would involve computing the MLE for selected
points within the neighbourhood of x and averaging their results. However, simply aggregating estimators
presents two primary challenges: it can violate the locality condition, by adding additional neighbours, or it
can introduce a clipping bias, by restricting the data to the original neighbourhood. TLE seeks to address
these challenges.
In its basic variant, local dimension estimation is performed at a selected point x in a neighbourhood
V ⊂ X centred at q, as well as at reflections of these points x through q. At each of these points, a
dimension estimate is made using only elements from neighbourhood of point q, thereby preserving locality.
To avoid clipping bias, the distances used are skewed to dq,r (x, v) = r(v−x)·(v−x)
2(q−x)·(v−x) . Finally, these estimates
are aggregated by harmonic mean to yield an estimator of the following form:
−1 1 X dq,r (x, v) dx,r (2q − x, v)

dim
d TLE (q) =− ln + ln .
2|V |(|V | − 1) r r
x,v∈V
x̸=v

This approach relies on the fundamental assumption that local intrinsic dimensionality exhibits uniform
continuity across the data, given its use of dimension estimates for points within the neighbourhood.
The MiND ML [71,89] estimator, along with its derivatives MiND KL and IDEA, which are discussed later,
also builds on the work of Levina and Bickel while hewing closely to likelihood principles. The authors study
the ratio ρ1,k+1 (p) and calculate a maximum likelihood estimator as the root
( )
ρd (p) log ρ(p)

ˆ N X
d = d: + log ρ(p) − (k − 1) =0 .
d p∈X
1 − ρd (p)

In case k = 1, this reduces to the harmonic mean of the Levina–Bickel estimator as recommended by MacKay
and Ghahramani.

11
Another maximum likelihood based estimator, GRIDE [26], uses only two neighbours. Unlike TwoNN,
which will be described later, these need not be the two nearest neighbours. The intention is that the
neighbours be sufficiently far away to overcome problems caused by noise. For two parameters n1 < n2 , the
ratio ρn2 ,n1 is studied. Note that the TwoNN method uses only ρ2,1 . The distribution of these ratios yields
a concave log-likelihood function which in general can be maximised by numerical optimization. However,
where n2 = n1 + 1 a closed form solution exists, which reduces to the standard MLE estimator for k = 2
in case n1 = 1. In these cases, the authors note that if the data are generated from a Poisson process then
the the ratios follow the Pareto distribution. Furthermore, the ratios ρn1 +1,n1 (p) for each n1 are jointly
independent. This allows for uncertainty quantification and, if data over a range of n1 are considered, a
reduction in the variance of the estimator.
A method for adaptively choosing the neighbourhood size to carry out maximum likelihood estimation
is described in [27] while Amsaleg et al . demonstrate an approach using methods from extreme value theory
in [3].

Fitting distributions Building on the premises of Levina and Bickel’s work, Facco et al . [32] developed
the TwoNN method. This uses only distance information to the two nearest neighbours of each point. By
using only the two smallest distances, the neighbourhoods involved can be assumed to be closer to Euclidean
space and the probability distribution need only be approximately constant on the scale r2 (p). The ratio
ρ2,1 (p) is the local statistic of interest here. Assuming that ρ2,1 is independently identically distributed at
each p, and letting F be the cumulative distribution function (CDF) of the distribution, we have

log(1 − F (ρ2,1 ))
= d.
log(ρ2,1 )

The observed ratios provide an empirical CDF which is used to estimate the intrinsic dimension by linear
regression, fitting a line through the origin. The authors recommend truncating this data before fitting the
line, as the higher values of the CDF tend to be noisier.
Rozza et al . claim that, for large d, the value of k needed for the data to closely match the theoretical
distribution grows exponentially with d [89]. The grounds for this claim appear weak, based on an assessment
of the probability of a point lying near the centre of its neighbourhood ball when, in fact, that event is already
conditioned on. It is nevertheless true that for many datasets a large value of n is needed to ensure that a
knn neighbourhood does not meet the boundary of the manifold, so that the methods of this paper are still
relevant.
To address the possibility that the underlying assumptions of maximum likelihood estimators are far
from being true, one proposal, MiND KL, is to compare the distances to neighbours not to the theoretical
distribution, but to an empirical distribution obtained distances to neighbours in a random sample of N
points from a unit ball of each dimension. The dimension estimate is obtained by using the Kullback–Leibler
divergence to select the dimension which most closely matches the data. This method is not so well motivated
for the possibility that M is not geometrically similar to a unit ball.

m
Other distance-based approaches Another proposal from the same paper, IDEA, uses that d = 1−m
where m is the expected norm of a vector sampled uniformly from the ball. Estimating m by
N k
1 XX
m̂ = ρj,k+1 (xi )
N k i=1 j=1

they use
m̂
dˆ =
1 − m̂
as a consistent estimator of d. The authors note an underestimation bias, which they attribute to small
x
sample size, but which may also be due to the convexity of the function 1−x on (0, 1). The bias in the
estimator is corrected using a jackknife technique: subsamples are generated where each point is included
√
with probability p and the number of nearest neighbours is reduced to k p, so that increasing p emulates a

12
k
situation where k, N → ∞ and N → 0. The resulting dimension estimators are fitted to a curve
a1
dˆ = a0 − pN
.
log2 ( a2 + a3 )

The horizontal asymptote a0 is used as the estimate (unless a1 < 0, indicating that dˆ in fact decreases with
p; in this case the original estimate dˆ for the entire dataset is preferred).
The earlier work of Pettis et al . [84] is also relevant here. These authors directly calculated the distribution
1
of rk (x) and from this obtained that the expected value of the mean distance from x to knn(x) is Gk,d k 1/d Cn
1/d
where Gk,d = kΓ(k+Γ(k)1 and Cn , while sample dependent, is independent of k. Taking logarithms yields a
d)
regression problem
1
log r̄k − log Gk,d = log k + log Cn ,
d
Pk
where the slope is d . On the left hand side of this equation, while r̄k = k1 j=1 rj (x) is fixed, the value of
1

G depends explicitly on d. An iterative regression method is used, where log G is initially taken to be 0,
allowing the regression to be carried out, and the resulting value of d is then used to recalculate log G, until
the scheme converges. This was later refined further by Verveer and Duin [109].
Other estimators in this general category include [46].

2.2.3 Angles between nearest neighbours

The DANCo method [20] uses information about angles between neighbours in addition to distances.
A maximum likelihood estimate from the distances is made as suggested in [71] by solving the problem,
( )
X X
dˆML = argmax N log(kd) + (d − 1) log(ρ1,k+1 (xi )) + (k − 1) log(1 − ρd (xi )) .
1,k+1
1≤d≤D xi ∈X xi ∈X

Let the vector θˆi be given by the angles subtended at xi by each pair of points in knn(xi ). As each
component of θˆi follows a von Mises–Fisher distribution with parameters ν and τ , at each point xi the
parameters can be estimated with a maximum likelihood approach. This yields vectors ν̂ = (ν̂i )N i=1 and
τ̂ = (τ̂i )N
i=1 with means ν̄ and τ̄ .
The statistics dˆML , ν̄ and τ̄ are then compared with the statistics from a family datasets with a known
intrinsic dimension (in this case a uniform sample of N points from S n for 1 ≤ n ≤ D) to provide an
estimate.
Another method based on angles is the Expected Simplex Skewness method, ESS. This method has two
variants: ESSa and ESSb. We describe only ESSa in detail. Consider knn(p) ⊂ X and a target dimension,
m. We may form (m + 1)-simplices with one vertex at the centroid of knn(p) and the others at m + 1 points
drawn from knn(p). A comparison simplex can be formed where at one vertex all edges meet orthogonally,
and the side lengths of those edges are the same as the side lengths incident to the centroid. For each
simplex, a statistic referred to as “simplex skewness” is obtained as the ratio of the volume of the simplex
to this comparison simplex.
For example, in the case of m = 1 the area of a triangle with one vertex at the centroid is compared to
the area of a right triangle with short side lengths equal to the lengths of the edges incident to the centroid.
The simplex skewness for this 2-simplex is simply sin(θ), where θ is the angle at the centroid.
For uniformly distributed points in a d-dimensional ball B d with volume measure µ, the expected simplex
skewness in case m = 1 is Z
(1) 1
sd = | sin(θ(x))| dµ(x)
Vd B d
where Vd is the volume of the unit ball and θ(x) is the angle at the centre of the ball between x and any
fixed co-ordinate axis.
Similarly, for m > 1, we obtain

u ∧ v1 ∧ · · · ∧ vm Γ( d2 )m+1
Z
(m) 1
sd = m dµ(v1 ) dµ(v2 ) · · · dµ(vm ) =
Vd (B d )m |v1 | · · · |vm | Γ( d+1 m d−m
2 ) Γ( 2 )

13
where u is a unit vector along a chosen coordinate axis.
When simplex edges are short, the precise position of the centroid has a major influence on the skewness.
To mitigate this, the empirical estimate ŝ(m) is given by a weighted mean of the skewnesses of each simplex,
(m)
where the weights are the products of the edge lengths incident to the centroid. Since the sd increase
monotonically to 1 as d → ∞, by comparing the skewness of simplices in the data to the expected simplex
skewness for balls of varying dimensions d, a dimension estimate can be obtained by linear interpolation [48].
The alternative method ESSb takes the unit vector in each edge direction and estimates the dimension
from the projection on each edge to to the others, using a similar process. A simplex-based approach was
also used in [21].

2.2.4 Other Parametric methods

The FisherS method [2] introduces an estimation technique that leverages the principle of linear Fisher
separability. The underlying insight is that separability becomes more pronounced in higher-dimensional
spaces. A key aspect of this method is its independence from the manifold hypothesis. It is also important to
note that irregularities in data sampling can result in a reduction of the estimated intrinsic dimensionality.
To estimate the dimension, the probability that a randomly selected element of the processed set is not
separable is estimated. Then the obtained value is compared with the values for the equidistributions on a
ball, a sphere, or the Gaussian distribution. The method of [101] also relies on this separability property.
Other estimators are based on distances in the graph at intermediate scale, rather than at the nearest
neighbour scale [41] or on comparing the Wasserstein distance between samples of different size [13].

2.3 Estimators using topological and metric invariants

In this section, we discussion three related dimension estimators. The k-nearest neighbour estimator KNN
and the minimum spanning tree estimator – a.k.a. zeroth-persistent homology estimator PH0 – are both
based on the theory of Euclidean functionals [114]. Both PH0 and mag are based on invariants of metric
spaces. We refer the reader to [40, 81] about connections between persistent homology and magnitude. We
also note a related concept to magnitude dimension which we do not cover here, the spread dimension
estimator [28].
Due to their similarities, we devote Appendix A to a comparison of the empirical performances of PH0 and
knn on the benchmark datasets which we describe in Section 3.2. In Appendix B, we consider the empirical
performance of on spheres of different dimensions, which demonstrate limitations of mag in estimating
dimension from finitely many point samples. Here we outline the theoretical foundations of such methods.

2.3.1 MST and PH dimension

Given a finite point set X ⊂ X in a P
metric space, the minimum spanning tree MST(X) of X is a tree with X
as vertex set, such that the sum of (xi ,xj )∈MST(X) d(xi , xj ) over edges E of the tree is minimal. For α > 0,
we define the α-weight of X to be
X
Eα (X) := d(xi , xj )α (1)
(xi ,xj )∈MST(X)

Given the relationship between zeroth-dimension persistent homology and minimum spanning trees, it is
natural to generalise this notion to higher dimension homological features captured by X. We let PDi (X)
denote the finite part of the persistence diagram of the i-th Vietoris-Rips or Čech persistent homology of X.
We let the α-total persistence of the persistence diagram be
X
Eαi (X) = (d − b)α (2)
(b,d)∈PDi (X)

In particular if we consider dimension i = 0, we have Eα (X) := Eα0 (X). Some relationships also hold when
i > 0; see [93, 94] for details.

14
In [1, 10, 23], it is proposed that the intrinsic dimension of data X be estimated by taking various sub-
samples X′ ⊂ X, and computing the slope m of the curve log(Eα0 (X′ )) as a function of log(|X′ |). The inferred
dimension is then given by
α
dˆ = . (3)
1−m
There is considerable literature on the study of minimum spanning trees that justifies this estimator. In the
study of minimum spanning trees on point sets in [0, 1]d , it has been a folklore theorem that the growth rate
of Eα with the number of points is at most O(nmax(1−α/d,0) ) for α ∈ (0, d) [61]. In other words, for higher
dimensional cubes d > α, the α-weight Eα can grow sublinearly with n, while for lower dimensional cubes
d ≤ α, the α-weight Eα does not grow with n at all. This motivated the definition of the minimum spanning
tree dimension of a bounded metric space X in [61] as

dimMST (X) := inf{α > 0 : Eα (X) finite ∀ finite subsets X ⊂ X}. (4)

Subsequently, the persistent homology dimension was analogously defined [10] to be

dimiPH (X) := inf{α > 0 : Eαi (X) finite ∀ finite subsets X ⊂ X}. (5)

By definition it follows that dimMST (X) = dim0PH (X). In Theorem 2 of [61], it was shown that for any
bounded metric space X, the minimum spanning tree dimension recovers the upper box counting dimension

dimMST (X) = dimbox (X). (6)

Abstractly defined, neither eqs. (4) and (5) are amenable to computation. Adams et al. [1] obtained a lower
bound on dimMST (X) based on results of Schweinhart [93] by extending the O(n1−α/d ) bound on the growth
of Eα to any bounded metric space X. For α ∈ (0, dimMST (X)), and any {x1 , . . . , xn } ⊂ X, there is some
constant Cα,d such that

α
log Eα ({x1 , . . . , xn }) ≤ 1 − log n + Cα,d (7)
dimMST (X)

This implies for any sequence (xi ) of distinct elements of X, we can obtain a lower bound on the upper box
dimension and minimum spanning tree dimension by

γ log(Eγ ({x1 , . . . , xn })
dimMST (X) ≥ where β = lim sup . (8)
1−β n→∞ log n

The probabilistic properties of the lower bound γ/(1 − β) in eq. (8) have also been widely studied. Given
a probability measure µ with bounded support on a metric space X, [93] defines the persistent homology
dimension of the metric measure space (X, µ) as

γ log E(Eγi ({x1 , . . . , xn })

dimiPH (µ; γ) = where β = lim sup , (9)
1−β n→∞ log n

given a parameter γ > 0. We recall the classic result for probability measures on Euclidean space by
Beardwood, Halton and Hammersley [8], and Steele [99], which implies dim0PH (µ; γ) = d for µ an absolutely
continuous measure on Rd , and γ ∈ (0, d).
Theorem 2.2. Let µ be a probability measure on Rd with compact support, and f be the density of its
absolutely continuous part. If x1 , . . . , xn are i.i.d. samples from µ, then with probability 1,

Eα ({x1 , . . . , xn })
Z
lim → c(α, d) f (x)1−α/d dx, (10)
n→∞ n1−α/d Rd

for α ∈ (0, d). Here c(α, d) > 0 is a constant that only depends on α, d.

15
Refinements of Theorem 2.2 have been proved for special cases. If µ is the uniform distribution on the
d-dimensional unit cube, Kesten and Lee [58] proved a central limit theorem for Eα : for α > 0, we have a
convergence in distribution
√

Eα ({x1 , . . . , xn }) 2
n − µ → N (0, σα,d ). (11)
n1−α/d
Costa and Hero [24] generalise Theorem 2.2 from RD to compact Riemannian manifolds of dimension d, with
f being a bounded density relative to the density of the manifold. In [23] they also observe that the limit
right hand side of eq. (14) is in fact related to Rényi entropy.
Recently, [93] proved a generalisation of Theorem 2.2 for d-Ahlfors regular measures on a metric space.
A measure µ is d-Ahlfors regular if there is are real, positive constants c, δ0 > 0 such that for all δ ∈ (0, δ0 ),
and x ∈ X, the measure of the open ball of radius δ at x is bounded by
1 d
δ ≤ µ(Bδ (x)) ≤ cδ d . (12)
c
This condition is satisfied, for example, for uniform measures on a d-dimensional manifold. Theorem 3
of [93] implies for γ ∈ (0, d), the persistent homology dimension of a d-Ahlfors regular measure recovers the
dimension parameter d:
dim0PH (µ; γ) = d.
Remark 2.3. We note that other aspects of the minimum spanning tree are related to dimension: for samples
from a d-Ahlfors measure, [62] showed the the maximum distance along an edge of its minimum spanning
tree scales as ((log n)/n)1/d .
Remark 2.4. For data sampled from a submanifold, [23] proposed using an Isomap based geodesic distance
distance estimation for the construction of the minimum spanning tree.
In practice, since constructing a minimum spanning tree on n points with full metric data is computa-
tionally onerous, a simplification can be made by taking the k-nearest neighbour graph and computing the
minimum spanning tree of the knn graph instead to speed up the computation.

2.3.2 The knn graph

The k-nearest neighbour graph Gk (X) on a finite subset X of a metric space has for every point in X an edge
between itself and its q th nearest neighbour in X for q ≤ k.
X
Lkα (X) = d(x, y)α (13)
(x,y)∈Gk (X)

be the total edge length of the knn graph. Like the total weight of its minimum spanning tree E α , the total
edge length of the knn graph Lk1 satisfies suitable additive properties such that the Euclidean functional
theory as described in [114] applies. Thus, a similar asymptotic result can be derived [114, Theorem 8.3]:
Theorem 2.5. Let µ be a probability measure on Rd with compact support where d ≥ 2, and f be the density
of its absolutely continuous part. If x1 , . . . , xn are i.i.d. samples from µ, then

Lk1 ({x1 , . . . , xn }) P
Z
lim −
→ c(k, d) f (x)1−1/d dx, (14)
n→∞ n1−1/d Rd

Here c(k, d) is a constant that only depends on k, d.

Given this theoretical characterisation [22] proposes an estimator KNN using a similar strategy to the
estimation of dimMST and dim0PH , as expressed in eq. (3). Given data X, we compute the total length Lk1 (X′ )
of k-nearest neighbours graphs of subsamples X′ ⊂ X, and infer the slope m of the curve log(Lk1 (X′ )) as a
function of log(|X′ |):
1
dˆ = . (15)
1−m

16
Remark 2.6 (Error propagation). Both dˆknn and dˆPH are subject to instability when the intrinsic dimension
is high. For high dimensions, the regressed slope m is just under one m = 1 − 1/d; any small error ϵ in the
slope can cause a large change in the estimated dimension
d
m 7→ m + ϵ =⇒ dˆ →
7
1 − ϵd
The dimension estimate will be impacted if |ϵ| > 1/d; for datasets with high intrinsic dimension the error
tolerance in the inference of the slope becomes smaller. In particular if the slope is over-estimated ϵd ≳ 1,
the dimension estimate can even become very negative. This is reflected in the performance of KNN on M10d
Cubic with intrinsic dimension 70 (see Appendix A).
Remark 2.7 (Robustness to outliers). Given a probability measure on an embedded manifold, we can regard
the absolutely continuous part of the measure (w.r.t. the density of the underlying manifold) as the ‘signal’,
while the singular part can be regarded as a model for outliers. While the minimum spanning tree and
k-nearest neighbour graphs are individually sensitive to outliers in data, Theorems 2.2 and 2.5 suggests that
the scaling of the total edge lengths of such objects with increasing sample size should be robust against
outliers, as the scaling should only depend on the absolutely continuous part of the measure. Nonetheless,
in practice, if the measure of the absolutely part is small (i.e. the number of outliers dominate the sample),
then it may take a large number of samples for the asymptotic scaling behaviour of Eα and Lk to emerge.

2.3.3 Magnitude dimension

The magnitude of a metric space is a way of measuring the “effective cardinality’ of the space [66, 76, 77, 81].
For simplicity of exposition, we only consider compact subsets of Euclidean space here and refer the reader
to [77, §3] for a general definition of magnitude for compact metric spaces. For any finite metric space
X = {x1 , . . . , xn }, its magnitude is defined to be
X
|X| = (Z −1 )ij , where Zij = e−d(xi ,xj ) .
ij

If X is a compact subset of Rn , then its magnitude is given by (see [66, §1.3])

|X| = sup{|X|, X ⊆ X finite}. (16)

If a sequence of finite subsets Xk ⊂ Rn converges fo a compact subset X of Rn in Hausdorff distance, then

|Xk | → |X| [76, Corollary 2.7]. We note that in the general case, the approximation of magnitude with finite
subsets, and the stability of magnitude with respect to Gromov–Hausdorff distance between metric spaces
is a nuanced topic: see [56].
The magnitude dimension of a compact subset of Rn is derived from how magnitude behaves as the scale
of the space is varied. We let tX denote the metric space with metric rescaled as td(x, y) for t > 0. The
magnitude function is then the map t 7→ |tX|, which is continuous for X a compact subset of Rn [77, Corollary
5.5]. The magnitude function captures the effective cardinality of a metric space at different scales and its
limiting behaviour at large scales recovers the dimension. For X a compact subset of Rn , its upper magnitude
dimension is defined to be
log |tX|
dimMag (X) = lim sup . (17)
t→∞ log t
The lower magnitude dimension dimMag (X) is similarly defined where the lim sup is replaced by lim inf. For
X a compact subset of Rn , [77, Corollary 7.4] states that the upper (lower) magnitude dimension coincides
with the upper (lower) Minkowski dimension. When the limit exists and the lower and upper magnitude
dimensions coincide, we use that as the definition of the magnitude dimension dimMag (X).
Given finite samples X ⊂ X of a compact subset of Rn , it was proposed in [5, 111] that we consider
the magnitude function of X and use the slope of the line of best fit to log t against log |tX| for t ≫ 0 as a
heuristic estimation of dimMag (X).

17
2.3.4 Analysis
In our experiments, PH0 dimension performs comparably with other estimators, yet is susceptible to over-
estimation in the presence of noise (see discussion in Section 3.5). However, magnitude dimension requires
a lot of points for even spheres of moderate dimension (see Table B1), making it an unreliable estimator.

2.4 Other estimators

Other approaches have been developed but are not considered here. One common theme has been the use
of diffusion geometry, normalising flows, and convolution with Gaussians to recover information about the
manifold. Work in this area includes [45, 50, 53, 98, 102, 113].
One can also consider the trade-off between reducing dimension and minimising metric distortion of
dimension reduction methods such as [7, 107] to estimate dimension.
A quantum cognition machine learning approach was used in [19], where excellent performance against
noisy datasets is reported.

2.5 Local and global estimators

Estimators
Local lPCA, MLE, WODCap, ESS
Global PH0 , mag, KNN, GRIDE, TwoNN, DANCo, MiND ML, CorrInt

Table 1: A classification of estimators considered in our experiments into global and local estimators.

An alternative categorisation of estimators divides them into local and global methods, with local esti-
mators providing local dimension estimates which are then combined, while global estimates provide a single
value.
In truth, any reliable estimator for the dimension of a manifold is likely to first obtain local information
of some sort, aggregate this information, and then convert it into a dimension estimator. The estimators
usually known as “local estimators” are, from this point of view, those for which the local information is
already a dimension estimate.
In this section, we review some of the estimators already discussed above which might be described as
“global estimators”.
The method of Fan et al . [34] identifies different eigenvalues for each neighbourhood from lPCA. These
local eigenvalues are then combined before the thresholding method is applied to the combined data. In this
way, despite the method being a variant of “local PCA”, it does not directly produce local dimensions, and
so is strictly speaking a global method. The architecture of the scikit-dimension package does not easily
allow for implementation of this method and so the results as presented do not reflect the original method as
designed by Fan et al . Instead the thresholding method is applied locally, converting it into a local method.
Note that any lPCA method could be adapted to function in the same way, with thresholding being applied
to these “global eigenvalues”.
CorrInt operates in a similar spirit. A simple local estimate of dimension at a point p ∈ X is given by
log N (p,r2 )−log N (p,r1 )
log r2 −log r1 for two radii r1 < r2 , under the assumption that N (p, r) is proportionate to the volume
of the ball of radius r, and that this grows as rd . However, for CorrInt, rather than combining these local
values of d, the volumes are combined over all balls, and a single estimate of d is then calculated.
TwoNN considers the entire distribution of local statistics to obtain a global value. At each point,
the ratios ρ2,1 are considered. A local dimension estimate could be obtained from this ratio, but instead
the empirical distribution of these ratios across all p ∈ X is considered and compared to the theoretical
distribution, which is dependent on d. This approach seems to result in a relatively high standard deviation
compared to other estimators.
The estimators described in Section 2.3 might appear to be global in that a global geometric object,
usually a graph of some kind, is constructed, and then a single metric invariant is extracted from it. How-
ever, even here, the analysis of the estimators to demonstrate their convergence turns out to rely on an

18
understanding of how they behave locally, which can then be used to infer the global behaviour via some
additivity condition.
Truly global estimators generally assume some global structure in the data. For example, a direct
application of PCA can be used to estimate dimension, with the assumption here being that the data lies in
an affine subspace. The MiND KL estimators are also truly global, making the global assumption that the
distribution of distances behaves similarly to that in a ball.
The WODCap method stands somewhat alone, in that it is essentially a bilocal estimator. Two points
are used as the centres of intersecting balls, and the resulting estimate of dimension is an estimate at this
pair of points.

2.6 Underestimation
Underestimation of dimension has been widely reported, especially for higher-dimensional datasets, and is
observed again in this survey. This is commonly attributed two possible causes: the “boundary effect’ or
“edge effect”, with observations of this dating back at least to [84], and a shortage of samples.
However, caution is warranted when considering how well this observation generalises beyond the bench-
marking set. We find that, when using datasets sampled from SO(n), overestimation is common.

2.6.1 Boundary effects

For points lying near the boundary of a manifold, many of the nearest neighbours are, in a sense, missing.
As a result, the distribution of nearest neighbour distance is far from that which holds for points in the
interior.
For a d-dimensional ball, or d-dimensional cube, as d grows the proportion of the volume contained in
a small neighbourhood of the boundary approaches 1. For example, in a unit ball, for any 0 < r < 1, a
proportion rd of the volume is within a distance r of the centre, and as d → ∞ this converges to 0. As a
result, many points in the sample will lie near the boundary and not satisfy the hypotheses of the estimator.
From the point of view of probability measures, the uniform measure on the ball or cube is therefore not
very different from a distribution on the boundary – the Wasserstein 2-distance between a ball and a sphere
is given by
Z 1 d/2
1/2 1/2
2 2π d−1 4π d/2
(1 − r) r dr =
0 Γ(d/2) Γ(d/2)d(d + 1)(d + 2)
which converges to 0 exponentially fast as d → ∞. As a result, even the highest quality dimension estimator
will have a tendency to underestimate the dimensions of balls and cubes.
In fact, the proportion of the volume of a sphere contained near an equator also approaches 1 as d grows,
and so spheres of different dimensions are also difficult to distinguish [65]
For this reason, pure probabilistic approaches to dimension estimation on balls and spheres will generally
end in underestimation. The assumption that underestimation is inevitable is a generalisation from this
observation to a belief that the negative bias induced by the presence of boundary will always grow with
dimension. However, this phenomenon whereby measure is concentrated near the boundary is not one which
holds universally. It depends on the family of manifolds being considered. For example, the d-dimensional
manifolds given by S d−1 × [−1, 1] will have a proportion δ of the volume lying within δ of the boundary.
This value is completely independent of d.
Remark 2.8. The observations here restrict the performance of local dimension estimators for datasets with
high intrinsic dimension. Local dimension estimators perform dimension estimation on local neighbourhoods
of points p in the dataset, often modelled by intersecting the datasets with an ϵ-neighbourhood ball (equiv-
alently a k-nearest neighbourhood can be expressed as an ϵ-ball with ϵ being the distance to the k-nearest
neighbour). Assuming that the local neighbourhood is sufficiently small, the data that a local estimator sees
is drawn from an approximately uniform distribution on a ball.
As the reference point is conditioned to lie at the centre of this ball, the impact may not be so severe.
However, the concentration phenomenon will still make it difficult to distinguish distributions of distances,
as these are concentrated around the radius of the ball.

19
2.6.2 Sample size effects
Another claimed source of negative bias is the insufficiency of data. For example, it has been claimed
that the original version of the CorrInt estimator [42] can only provide estimates dˆ < log diam(X)−log
2 log N
ε , where
ε ≪ diam(X) is the smallest radius considered [30]. However, this calculation is based on an assumption
that the volume of balls grows as rd for all values of r up to diam(X), which does not generally hold.
Assuming that a given estimator is asymptotically correct, it may be possible to use the given sample to
estimate the value it would take on an arbitrarily large sample. The IDEA estimator [90] attempts this by
using a jackknife subsampling method and fitting the estimates for a subsamples to a curve with a horizontal
asymptote.
For small sample sizes, in high dimensions, most points will be linearly separable from the rest of the data.
This means that the underlying geometric hypotheses do not hold at most points, so that it is reasonable
to expect significant difficulties in estimating dimension. Sensitivity to sample size will be discussed in the
analysis of our expermental results in Section 3.3.

2.6.3 Empirical correction

Some estimators (e.g. DANCo and the Camastra–Vinciarelli adaptation of CorrInt) seek to use empirical
corrections of the perceived negative bias. They start with a previously documented estimator (MiND ML or
CorrInt, respectively) which is believed to have a negative bias, run it on a family of datasets to get empirical
results, and then use this to produce a correction factor.
However, this relies both on the dataset under study being similar to the dataset used for correction and
on the underestimation being generally true. For example, CorrInt performs accurately on SO(n) and so
providing an empirical correction would, in fact, reduce accuracy.

3 Experiments: Results and Discussion

3.1 The scikit-dimension package
We carry out our experiments using an augmented version of the scikit-dimension package [6]. This
Python package provides a variety of dimension estimation methods equipped with default parameters as
suggested by the authors of individual methods.
Additional methods have been added to the package by the present authors, including amendments to
the architecture to allow neighbourhoods to be selected by radius rather than nearest neighbours.
The new methods incorporated GeoMLE, GRIDE, CDim, WODCap, Camastra & Vinciarelli’s extension to
the Grassberger Procaccia algorithm for CorrInt, the packing-number based estimator, and the magnitude
and PH0 estimators. New functionality is added for lPCA and MLE, where users can choose ϵ-neighbourhoods
in addition to knn neighbourhoods. We also added a probabilistic thresholding method for PCA [78]. We
recommend that new and old estimators are implemented into this framework to give practitioners a focal
point to begin. The code for the new estimators can be found on this Github fork of the scikit-dimension
package.

3.2 Datasets
We use a now standard collection of datasets for benchmarking purposes [18, 44, 89], with a small number of
additions. These datasets are readily generated in scikit-dimension. The datasets encompass a large range
of dimensions (1 to 70), codimensions (0 to 72) and geometries (flat, constant curvature, variable curvature).
We should note that not all of the datasets are drawn from uniform distributions on their manifolds. Each
underlying manifold is diffeomorphic to either a sphere or a cube. We give a brief description of each in
Table 2.
For some purposes we have also considered the standard matrix embedding of SO(n) in Rn×n , though
we have not fully benchmarked this dataset. This produces a homogeneous manifold of dimension n(n−1) 2 ,
not lying in any affine subspace, and having a new topology compared to the other benchmark datasets. We
feel that these datasets would be good additions to the benchmark manifolds of [6], as including manifolds

20
Dataset d D Description
M1 Sphere 10 11 Uniform distribution on round sphere
M2 Affine 3to5 and M9 Affine 3, 20 5, 20 Affine subspaces
M3 Nonlinear 4to6 4 6 Nonlinear manifold, could be mistaken to be 3d.
M4, M6 and M8 Nonlinear 4, 6, 12 8, 36, 72 Nonlinear manifolds generated from the same function
M5a Helix1d 1 3 A 1d helix
M5b Helix2d 2 3 Helicoid
M7 Roll 2 3 Classic swiss roll
M10a,b,c,d Cubic 10, 17, 24, 70 11, 18, 24, 72 Hypercubes
M11 Moebius 2 3 The 10 times twisted Moebius band
M12 Norm 20 20 Isotropic multivariate Gaussian
M13a Scurve 2 3 Surface in the shape of an “S”
M13b Spiral 1 13 Helix curve in 13 dimensions.
Mbeta 10 40 Generated with a smooth nonuniform pdf
Mn1 and Mn2 Nonlinear 18, 24 72, 96 Nonlinearly embedded manifolds of dimensions
Mp1, Mp2 and Mp3 Paraboloid 3, 6,, 9 12, 21, 30 Paraboloids nonlinearly embedded of dimensions

Table 2: Description of the benchmark manifolds from scikit-dimension [6].

with known geometric and topological information will increase knowledge of where specific estimators work
well.
For our experiments investigating the effects of noise, curvature, we consider a collection of datasets
where we have a good control and understanding of their dimensions and geometry. These include a torus
of revolution in R3 (which has no boundary) as well as families of paraboloids with varying curvature shown
in Figures 14, 12 and 13.

3.3 Comparisons of estimators on the benchmark datasets

For each estimator in the cohort listed in Table 3, we investigate various potential challenges which can
identify areas of weaknesses. The areas of focus are:
• Dependency on a tailored choice of hyperparameters to dataset to achieve near optimal performance;

• Whether the estimator requires many point samples to perform well;

• Performance on high dimensional datasets with flat geometry, such as M9 Affine, M10b,c,d, Cubic;
• High variance in performance between different random samples from the same measure;
• Poor performance on datasets that are non-linear, such as those with high curvature or uneven sampling
density. These include M6 Nonlinear, M12Norm, Mbeta, and the Mn and Mp Nonlinear and Paraboloid
datasets.
Note that an estimator performing well against an individual criterion above does not make it a “good”
estimator. A trivially bad estimator that outputs the same dimension for any dataset is both insensitive to
the choice of hyperparameter (none) and has zero variance.
In Appendix C, we detail the experimental procedure to assess a list of estimators against the benchmark
dataset and give the numerical results of those experiments. We defer discussions of each estimator’s perfor-
mance on the benchmark datasets with respect to the metrics above to the captions of the results tables in
Appendix C. In this section we give a qualitative summary of the observations on each estimator in Table 3
and an overview of how the estimators perform against the criteria.
Remark 3.1. Not all of the estimators discussed in Section 2 are included in our full experiments. Magnitude
dimension suffers from being sample hungry and performs poorly on even spheres of moderate dimensions.
The conical dimension and packing number estimators are computationally inefficient (at least in our possibly
naive implementations) and take too long to compute to generate sufficient data for calculations of standard
deviations. We recommend further research into how to improve the efficiency of those estimators.

21
Dependency on a tailored choice of hyperparameters In our experiments, we varied the hyperpa-
rameters of estimators around those used in the original papers, or the defaults in the scikit-dimension
implementation, as reported in Appendix D. We compared the performance of the best possible estimate
an estimator can achieve with optimal hyperparameters within our range, and the estimate achieved by
choosing hyperparameters that ensures good performance on most of the datasets in the benchmark set (this
hyperparameter is defined precisely in Appendix C). A large discrepancy in results between these two choices
indicates that an estimator needs tailored choices of hyperparameters to achieve optimal results. This is an
undesirable effect,
For the collection of “slope-inference” based global estimators considered here – KNN, PH0 , TwoNN,
and GRIDE – this discrepancy is often small, especially on low dimensional datasets without complicated
nonlinearities. This may be due to the fact that there are only one or two hyperparameters for these
estimators, far fewer compared to others on our list. There seems to be a locality bias for these estimators.
For KNN, the close to optimal performance (on low dimensional datasets without complicated nonlinearities)
can be reached by choosing the nearest neighbour parameter k to be close to one (see Table D6, and also
Figure 6 for an illustration). For PH0 , we can also guarantee reasonable performance with the choice
α = 0.5, which emphasises the contribution of small distances over large ones across edges of the minimum
spanning tree (Table D4). Choosing the hyperparameters that reduce GRIDE to MLE (with input from the
distance to the two nearest neighbour distances) is often effective (Table D8). We also note that given such
hyperparameters, the performances of GRIDE and TwoNN are often similar in Tables C7 and C10.
We defer discussions about other estimators and their need to tune hyperparameters to specific data in
the captions of the benchmark results tables in Appendix C. For local estimators, the specific issue of tuning
the neighbourhood size and aggregation method is analysed in greater detail in Sections 3.4.1 and 3.4.2.
Remark 3.2. We emphasise that our empirical study is subject to our particular choices of hyperparame-
ter ranges, which cannot encompass all possible hyperparameter combinations due to finite computational
resources. We refer the reader to Appendix D for the range of hyperparameters chosen for each estimator,
which was guided by the literature.

Sample Economy We restrict our assessment here to how an estimator performs on low dimensional
datasets (dimension < 6), as most estimators face challenges with even moderately high dimensional datasets.
One surprising observation is the slow increase in most estimators’ accuracy with the number of samples,
at least in the regime of N ∈ {625, 1250, 2500, 5000} being tested. As summarised in Table 3, and detailed
in Appendix C, this is often true with global estimators such as PH0 , KNN, MiND ML, GRIDE, and TwoNN
where N ∈ {625, 1250} often results in accurate dimensions estimation on low dimensional datasets. On the
other hand, while this is also observed on some local estimators such as ESS, TLE, other local estimators
such as lPCA, MLE only recover comparable performance in the N ∈ {2500, 5000} regime, suggesting that
they are more sample hungry. Given fewer samples, knn neighbourhoods have a larger effective radius and
so non-flat geometries in the dataset can further bias the estimation. Figure 3 demonstrates the behaviour
of all estimators on two datasets.

Accuracy on high dimensional datasets Overall, most estimators tend to underestimate, even on
their best hyperparameters. This is often especially acute on high dimensional datasets, and some nonlinear
datasets. One exception is lPCA, which appears to give the correct dimension every time. However, as we
will discuss below, this is mainly due to our ability to tune the hyperparameter to give the correct result;
with another hyperparameter the estimate could be substantially incorrect (see Figure 9).
As we would expect, most estimators are very good on low dimensional data, and struggle when the
dimension increases beyond 6. There are exceptions to this rule, with ESS and DANCo performing very
well on the high dimensional datasets. We note that, as the sample size increases, most estimators improve.
However, there are exceptions, with DANCo, ESS and FisherS not changing substantially on most datasets
as sample size increases.

Variance of dimension estimates Local estimators, such as lPCA, MLE, MiND ML, WODCap, and TLE,
tend to have a low variance. These estimators aggregate many local dimension estimates into a global
dimension estimate. The standard deviation on the mean, median, or harmonic mean of the local estimates

22
(a)

(b)

Figure 3: The best estimates from our benchmark tables for 625, 1250, 2500 and 5000 points for each
estimator on two different datasets, Mbeta and M8Nonlinear. As a general rule, increasing sample size
improves accuracy. However, to return a correct estimate would require a lot more than 5000 points from
these datasets for most estimators. Some estimators have a clear bias which does not reduce with increasing
sample sizes. Also seen is a differing level of responses to changing sample size, which appear broadly
consistent across the two datasets. For example, WODCap underestimates, becoming much better as sample
size increases, while on the other hand FisherS and lPCA do not change significantly from 625 points to 5000.

23
decreases with the number of local estimates. On the other hand, estimators such as PH0 and KNN suffer
from higher error sensitivity in high dimensions which can increase variance (see Theorem 2.6).
In Figure 4, we visualise the effect of increasing sample size on the variance of the estimate on M1Sphere
(uniform samples on S 10 ⊂ R11 ). We note that the standard deviation of MiND ML, KNN and DANCo
decreases at a slower rate than the other estimators as the number of samples increases. Indeed, the standard
deviation of DANCo increases, which may be due to an unusual choice of hyperparameters yielding the best
mean estimate.
The variance might be expected to decay with increasing sample size with rate N −1 . However, we observe
that for N = 625 the variance is higher than would be expected. As N increases from 625 to 1250, 2500,
and 5000, we might expect the variance to be 8, 4 and 2 times greater than its final value at N = 5000.
In fact, taking the median of these ratios over the estimators (discounting lPCA, which has 0 variance, and
DANCo, due to the unusual behaviour at N = 5000), we observe 15.3, 5.1 and 1.8. This suggests that, for
many estimators, small sample sizes which result in larger neighbourhoods that are less well approximated
by flat spaces create additional variance.

Figure 4: Standard deviation of dimension estimates of M1Sphere (uniform samples on S 10 ⊂ R11 ) over 20
runs, as shown in the tables in Appendix C, using the hyperparameters which provide the best estimate.
This plot has a logarithmic scale on both axes. lPCA has standard deviation zero, as does WODCap for
N = 2500, and so these points are not shown.

Nonlinear datasets Most estimators struggle quite a lot with these datasets. An exception is lPCA
though, as discussed, hyperparameter selection can ensure good performance for lPCA. We conjecture that
since lPCA infers the dimension of the tangent plane, nonlinear features such as variable sampling density
generate a smaller bias than with parametric estimators such as MLE.
It is interesting to compare which estimators work well on which datasets. For example, MLE, PH0 , KNN,
GRIDE, TwoNN, MiND ML, and CorrInt tend to perform well on M3,4,8 Nonlinear and struggle with M12
Norm and Mn1,2 Nonlinear; yet it is the reverse with lPCA, DANCo, and ESS. One particularly difficult
dataset is Mbeta. As a 10-dimensional manifold with ambient dimension 40, the dimension and codimension
are both relatively high. Furthermore, density and curvature are variable.
Some estimators, such as PH0 , KNN, and CorrInt, are meant to capture a notion of dimension of a metric
measure space, which can be distinct from the dimension of the support of the measure. Hence, for datasets
such as samples from the normal distributions (M12 Norm), the estimated dimension of such estimators can
be different to the dimension of the support.

24
No tailoring of params Sample economy High dim (> 6) Low variance Nonlinear
lPCA ✓
MLE ✓
PH ✓ ✓
KNN ✓ ✓
WODCap ✓
GRIDE ✓ ✓
TwoNN ✓ ✓
DANCo ✓
MiND ML ✓ ✓
CorrInt ✓
ESS ✓ ✓ ✓ ✓
FisherS ✓
TLE ✓ ✓

Table 3: Qualitative assessment of performances of estimators on the benchmark dataset. None of the
estimators consistently perform well on the nonlinear datasets in the benchmark.

3.4 Influence of hyperparameter choices on estimated dimensions

3.4.1 Aggregation
With local estimators, a dimension estimate is produced on different local neighbourhoods of query points,
and a global dimension estimate is then obtained by aggregation. The choice of aggregation method can be
considered to be a hyperparameter passed to the estimator. We considered varying the aggregation method
between the arithmetic mean, median, and harmonic mean.
While taking the mean or the median of the different pointwise estimates may be obvious choices, this
may lead to biases in the estimates for probabilistic methods. MacKay and Ghahramani suggested the use
of the harmonic mean of local dimension estimates to produce a global dimension estimate [72]. In this
particular case, the suggestion is based on the fact that this is the maximum likelihood estimator for the
global dimension, under the simplifying assumption that the estimates at each point are independent. The
harmonic mean is frequently used when averaging rates and its utility here is suggestive of dimension as a
rate at which volume opens up with radius [92].
Figure 5 makes clear the benefit of using alternative aggregation methods, especially when using small
neighbourhood sizes. It shows how dimension estimation varies when we change the neighbourhood size
parameter for n = 2500 samples of S 6 and SO(4). If we use the mean, MLE and TLE show divergent results
for low values of k. These are much improved by using the harmonic mean instead, ESS, on the other hand,
performs better for low k when the arithmetic mean is used.
For lPCA, CDim, and WODCap, their estimation process involves enumerating objects, and the output
range of local estimates is discrete (in the case of lPCA and CDim the actual estimate is an integer dimension).
The distribution of local estimates across the data can fall on a very small support ≪ |X|. We observe that
the median aggregated dimension can be sensitive to shifts in the distribution; for example, if half of the local
estimates are 6 and the other half are 7, a very small perturbation to the dataset can skew the distribution
to output a median dimension estimate of either 6 or 7. Similarly, a change in the hyperparameter (in this
case, the k-nearest neighbour parameter) can skew the relative proportion of estimates over the discrete set
of outputs, and the output can be unstable with respect to the change in hyperparameter. We see that this
instability is an artefact introduced by the median aggregation, since the mean and harmonic mean vary
continuously with k.

3.4.2 Local Neighbourhood Size

Bias Figure 5 shows how estimators behave as the parameter k, controlling the size of knn neighbourhoods,
changes. We consider two sample datasets, S 6 and SO(4), both of which are 6-dimensional. As noted in
Section 3.4.1, the harmonic mean provides the best estimates in general and so we will focus on this method.

25
mean har. mean median
12

8
SO(4)

12 lPCA FO KNN
MLE TLE
CDim WODCap
10 ESS Truth
MiND_MLi
8
S6

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
k (Nearest neighbors)

Figure 5: Comparison of dimension estimates of SO(4) and S 6 for estimators with a k-nearest neighbour
parameter. The input data consists of 2500 points uniformly sampled on the manifolds, which have intrinsic
dimension 6. The k parameter is varied from 4 to 50 in steps of 2. For local methods – all on the figure, apart
from MiND ML and KNN – we vary the method of aggregation over local estimates. Note that WODCap does
not aggregate the local dimension estimates, but rather the estimated volume fraction of spherical caps. We
also remark that MiND ML, which is hard to distinguish on the plot, consistently returns an estimate of 6.

The estimators obtained from distributions of nearest neighbour distances (MLE and TLE, excluding
the integer-valued MIND MLi) appear to converge to a common value as k grows, which for SO(4) is an
overestimate but for S 6 is an underestimate. For smaller values of k, both MLE and TLE tend to overestimate.
MLE, with harmonic mean, performs best for small values of k.
One clear potential source of bias for parametric estimators of this type is the failure of the underlying
hypotheses to hold. In general, these are, firstly, that a sample from a ball centred on p ∈ X yields points
with the same distance from p as a sample with uniform density from a Euclidean ball would and, secondly,
that the distances ri (p) for each p ∈ X are independently and identically distributed.
The failure of the first assumption can be caused by the failure of the embedding to be totally geodesic,
by the existence of non-manifold points phenomena, including the presence of boundary, and by variable
sampling density.
The second assumption is not true: for points p, q ∈ X which are close to each other the statistic ri is
positively correlated. Furthermore, since X arises from a binomial point process rather than a Poisson point
process, the existence of a densely sampled region with low values for ri necessarily implies that the remainder
of M is more sparsely sampled, so that for points p, q ∈ X which are far from each other, ri will be negatively
correlated. As argued in [67], it seems likely that the long-range effects are much weaker, and it is shown in [26]
that distance ratios are independent as long as they come from disjoint neighbourhoods. Trunk [104] found
spatial correlations were not significant by comparing empirical distributions to the theoretical distribution
which arises from the assumption of independence and using a Kolmogorov–Smirnov test. However, the
exact experimental procedure is unclear: from context it seems likely that at best d ≤ 4 is considered.
All of these errors will tend to become more evident as neighbourhood size increases, so that it is
reasonable to anticipate that parametric methods will experience a bias that increases with k, as MLE

26
appears to.
The remaining estimators, lPCA and CDim, which seek an approximating affine subspace, and ESS,
demonstrate a tendency to increase with k. For lPCA and CDim, the estimate relies on how well the
collection of k + 1 points approximates an affine subspace. The underlying hypotheses are less delicate,
with the failure to be totally geodesic and the existence of non-manifold points being the sources of error.
However, boundary points need not be an issue. As noted below, a good estimate requires k to be sufficiently
large. However, the data for SO(4) demonstrate how, in a high codimension setting, lPCA can significantly
overestimate if k is too high.

Throttling The number of points in a neighbourhood can “throttle” the dimension estimator, so that it
is impossible for it to return an estimate above a certain value.
For example, if knn neighbourhoods are used, then it is clear that lPCA can only give k non-zero eigen-
values and so the dimension estimate will never exceed k. We can describe this as linear throttling – for
an accurate estimate the parameter k must grow linearly in d. This phenomenon is clear in Figure 7 and
Figure 8 where throttling occurs for 4 ≤ k ≤ 6
However, there are also estimators with exponential throttling, where k must grow exponentially in d.
For example, the estimates of [60] suffer from their discrete nature. Volumes are approximated by numbers
of points in balls built from the knn graph. The bounded valence of the graph is what generates throttling
1
in this instance. The doubling property estimator, for example, can never exceed log2 (k + k+1 ). This is
because |Be (i, 1)| = k + 1 always and, since each knn of xi has at most k + 1 unique nearest neighbours, the
maximal value of |Be (i, 2)| is k(k + 1) + 1.
The maximum value which can be returned by WODCap is S −1 ( k+1 2
) where
!
ˆ dˆ + 1 1
S(d) = I 43 , .
2 2

Given that we have S(d) = I 34 ( d+1 1 2

2 , 2 ) ≈ k+1 . We can write the regularized incomplete beta function in
terms of the Gamma function and an integral:
Z 34
Γ( d+1 1
2 , 2) d−1 1
S(d) = t 2 (1 − t)− 2 dt
Γ( d+1
2 )Γ( 1
2 ) 0

Then by using Gamma function identities, if d is even, we arrive at

2 πd!
k≈R 3 − 1.
4
t
d−1
2 (1 −
1
t)− 2 dt 2d−2 (( d2 )!)2
0

The right term of the product on the right hand side goes to 0 as d → ∞; however, it is completely
dominated by the reciprocal of the integral. As if we plug in d = 2, 20 and 200, we get k ≈ 19, 528 and
2.6 × 1014 respectively. From this it appears that as d grows k needs to grow exponentially. Since k is
intended to define a small neighbourhood, the number of points of the sample required for the method to
estimate d becomes completely unfeasible.
Probabilistic phenomena can cause throttling as well, in the sense that k must grow at a certain rate
for a correct estimator to be returned with a given probability. An example occurs with CDim, where
Figure 5 is suggestive of throttling. Note that the theoretical guarantees for CDim already require k to grow
exponentially with d. While this sample size is sufficient for the estimator to work, it need not be necessary.
The algorithm to compute the estimate finds the largest possible subset of directions to nearest neighbours
where all pairwise angles are at least π/2. In a large dimensional space, the angle between any two directions
is very likely to be close to π/2, so that for any given pair it is approximately equally likely that the angle
is greater than or less than π/2. For a subset of length d,ˆ the probability that any given additional vector
ˆ
can be added to it is approximately 2−d . Since there are k − dˆ neighbours to check, the probability that the
subset cannot be enlarged is
ˆ
!k−dˆ
2d − 1
.
2dˆ

27
KNN Performance on Benchmark Manifolds
M1 Sphere M2 Affine 3to5 M3 Nonlinear 4to6 M4 Nonlinear M5a Helix1d M5b Helix2d
10.0 3.0 4.0 4.0 2.5
9.8 3.9 3.9 1.01
2.9 2.4
9.6 3.8 3.8
1.00 2.3
9.4 2.8 3.7 3.7
2.2
9.2 3.6 3.6 0.99
2.7 3.5 2.1
9.0 3.5
0.98 2.0
3.4 3.4
8.8 2.6
M6 Nonlinear M7 Roll M8 Nonlinear M9 Affine M10a Cubic M10b Cubic
6.0 20 10.0 17
14.5
2.00 19 9.5 16
5.8 14.0
18
1.95 13.5 9.0 15
5.6 17
13.0 8.5 14
5.4 1.90 16
12.5
Estimated Dimension

15 8.0 13
5.2 1.85 12.0 7.5
14 12
M10c Cubic M10d Cubic M11 Moebius M12 Norm M13a Scurve M13b Spiral
24 600 21
2.00
23 500 20 2.00 1.8
1.95
22
400 1.90 19 1.6
21 1.95
20 300 1.85 18 1.4
19 200 1.90
1.80 17 1.2
18
100 1.75 1.85
17 16 1.0

Mbeta Mn1 Nonlinear Mn2 Nonlinear Mp1 Paraboloid Mp2 Paraboloid Mp3 Paraboloid
10 18 24 6.00 9.0
23 3.0 5.75 8.5
9 17
22 2.9 5.50 8.0
16
8 21 5.25 7.5 Median
2.8 IQR
15 20 7.0 Truth
7 5.00
14 19 2.7 6.5
4.75
6 18 2.6 6.0
13 4.50
17 5.5
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
k (Number of Neighbors)

Figure 6: Performance of KNN-estimator vs k on benchmark manifolds, for 5000 samples on each manifold.
We show the empirical median dimension estimate, and the interquartile range, over 20 sets of random
samples. Note the different scaling of the y-axis from dataset to dataset. The high variance for M10d Cubic
is due to the sensitivity to errors in high dimensions of the method; see Theorem 2.6.

28
lPCA: Dimension estimate of S 6, 2500 samples
8
Truth maxgap Kaiser
FO ratio broken stick
7 Fan p. ratio Minka
6

0
0 20 40 60 80 100
k Nearest neighbors

Figure 7: Dimension estimates from lPCA when neighbourhood sizes are varied from 4 to 100, for S 6 ⊂ R7 .
Colours indicate different threshold methods. Solid curves correspond to estimates using knn neighbour-
hoods, dashed curves correspond to using ϵ-neighbourhoods, where for each nearest neighbour value k, the
radius ϵ is chosen to be the median knn distances

Once dˆ is relatively large, it is therefore necessary to have a very large value of k in order to obtain a
new direction which is obtuse to the existing dˆ directions with a given probability. In fact, using Markov’s
inequality and considering the matrix of inner product signs as in [100], we can see that the probability of
ˆˆ
finding dˆ such directions is kdˆ 2−d(d−1)/2 , and so, by Stirling’s formula, for large d the probability of finding
k−d̂(d̂−1)/2
dˆ directions is bounded above by C 2 √ k
. This indicates that CDim suffers at least quadratic throttling.
All the tangential estimators described in this survey are vulnerable to at least linear throttling. This
appears inevitable, since the tangent space is the linear structure of the manifolds, the affine space which
approximates the data. Unless k ≥ d we cannot hope to accurately recover the entire affine space.
This consideration makes clear that, for very high-dimensional datasets, the use of a tangential estimator
cannot be recommended. The presence of throttling can be detected in lPCA by comparing the estimated
dimension to the hyperparameter k and we recommend that implementations of lPCA warn users when
throttling is occurring.

Noise The use of smaller neighbourhoods is dangerous in the presence of noise, where it will tend to
produce overestimates while in the presence of curvature, larger neighbourhoods are more vulnerable. The
influence of noise is discussed in Section 3.5

3.4.3 Influence of thresholding methods on lPCA

In Table 4, we observe how different choices of thresholding methods for lPCA have a significant effect on the
dimension estimate. On our benchmark datasets, while the FO thresholding method works better than most,
there are datasets such as M4 Nonlinear, M5b and Helix2d where it is less accurate than other thresholding
methods. The range of dimension estimates displayed in Table 4 for different thresholding methods across
many datasets demonstrates the influence of the thresholding method.
The choice of hyperparameters for thresholding methods also greatly affects estimates. We demonstrate
this sensitivity by varying the α parameter for FO and alphaRatio on the M6 Nonlinear dataset, the out-
comes which we show in Figure 9. We note that the out of the box choices for α for each version in

29
lPCA: Dimension estimate of SO(4) 16, 2500 samples

16 Truth maxgap Kaiser

FO ratio broken stick
Fan p. ratio Minka
14

0
0 20 40 60 80 100
k Nearest neighbors

Figure 8: Dimension estimates from lPCA when neighbourhood sizes are varied from 4 to 100, for
SO(4) ⊂ R16 . Colours indicate different threshold methods. Solid curves correspond to estimates using
knn neighbourhoods, dashed curves correspond to using ϵ-neighbourhoods, where for each nearest neighbour
value k, the radius ϵ is chosen to be the median knn distances.

FO Fan maxgap ratio part. ratio Kaiser broken stick Minka

M1 Sphere 10 11 10.5 (0.0) 6.1 (0.0) 8.0 (0.0) 6.7 (0.0) 7.5 (0.0) 5.5 (0.0) 1.0 (0.0) 8.9 (0.0)
M2 Affine 3to5 3 5 3.0 (0.0) 2.0 (0.0) 1.7 (0.0) 3.1 (0.0) 1.6 (0.0) 2.7 (0.0) 3.8 (0.0) 2.0 (0.0)
M3 Nonlinear 4to6 4 6 5.0 (0.0) 2.7 (0.0) 3.1 (0.0) 2.9 (0.0) 2.9 (0.0) 2.9 (0.0) 2.9 (0.0) 3.4 (0.0)
M4 Nonlinear 4 8 6.3 (0.0) 3.1 (0.0) 4.3 (0.0) 3.3 (0.0) 3.7 (0.0) 3.6 (0.0) 3.2 (0.0) 4.4 (0.0)
M5a Helix1d 1 3 1.0 (0.0) 1.8 (0.0) 2.0 (0.0) 3.0 (0.0) 0.0 (0.0) 1.0 (0.0) 1.6 (0.0) 1.6 (0.0)
M5b Helix2d 2 3 3.0 (0.0) 1.9 (0.0) 2.0 (0.0) 1.9 (0.0) 1.8 (0.0) 1.6 (0.0) 2.1 (0.0) 2.0 (0.0)
M6 Nonlinear 6 36 11.1 (0.0) 5.3 (0.0) 9.0 (0.0) 5.9 (0.0) 6.4 (0.0) 8.6 (0.0) 8.1 (0.0) 8.3 (0.0)
M7 Roll 2 3 2.0 (0.0) 1.0 (0.0) 1.0 (0.0) 2.8 (0.0) 0.6 (0.0) 1.8 (0.0) 2.9 (0.0) 1.0 (0.0)
M8 Nonlinear 12 72 23.3 (0.0) 11.1 (0.0) 20.5 (0.1) 12.3 (0.0) 13.5 (0.0) 18.9 (0.0) 17.0 (0.0) 18.2 (0.0)
M9 Affine 20 20 20.0 (0.0) 9.7 (0.0) 16.8 (0.0) 9.9 (0.0) 12.1 (0.0) 8.9 (0.0) 1.0 (0.0) 16.6 (0.0)
M10a Cubic 10 11 11.0 (0.0) 5.9 (0.0) 8.5 (0.0) 6.0 (0.0) 7.0 (0.0) 5.1 (0.0) 1.0 (0.0) 9.0 (0.0)
M10b Cubic 17 18 18.0 (0.0) 8.9 (0.0) 15.1 (0.0) 9.0 (0.0) 11.1 (0.0) 8.1 (0.0) 1.0 (0.0) 15.0 (0.0)
M10c Cubic 24 25 25.0 (0.0) 11.7 (0.0) 21.2 (0.0) 11.9 (0.0) 14.6 (0.0) 11.0 (0.0) 1.0 (0.0) 20.3 (0.0)
M10d Cubic 70 71 54.8 (0.0) 25.3 (0.0) 55.2 (0.0) 26.3 (0.0) 32.2 (0.0) 28.2 (0.0) 1.0 (0.0) 45.4 (0.0)
M11 Moebius 2 3 2.4 (0.0) 1.2 (0.0) 1.2 (0.0) 2.7 (0.0) 0.9 (0.0) 1.8 (0.0) 2.6 (0.0) 1.2 (0.0)
M12 Norm 20 20 20.0 (0.0) 9.9 (0.0) 17.0 (0.0) 9.9 (0.0) 12.2 (0.0) 8.9 (0.0) 1.0 (0.0) 16.9 (0.0)
M13a Scurve 2 3 2.0 (0.0) 1.0 (0.0) 1.0 (0.0) 2.8 (0.0) 0.6 (0.0) 1.8 (0.0) 2.9 (0.0) 1.0 (0.0)
M13b Spiral 1 13 2.0 (0.0) 1.0 (0.0) 1.0 (0.0) 6.5 (0.1) 0.7 (0.0) 2.0 (0.0) 3.0 (0.0) 1.0 (0.0)
Mbeta 10 40 9.8 (0.0) 4.5 (0.0) 5.9 (0.1) 5.2 (0.0) 4.7 (0.0) 9.0 (0.0) 7.3 (0.0) 7.7 (0.0)
Mn1 Nonlinear 18 72 18.3 (0.0) 10.7 (0.0) 20.5 (0.0) 11.6 (0.0) 13.1 (0.0) 17.9 (0.0) 17.4 (0.0) 16.0 (0.0)
Mn2 Nonlinear 24 96 24.3 (0.0) 13.3 (0.0) 29.2 (0.0) 14.6 (0.0) 16.6 (0.0) 22.3 (0.0) 20.9 (0.0) 21.1 (0.0)
Mp1 Paraboloid 3 12 3.0 (0.0) 2.0 (0.0) 1.6 (0.0) 3.9 (0.0) 1.6 (0.0) 3.0 (0.0) 3.9 (0.0) 2.0 (0.0)
Mp2 Paraboloid 6 21 5.9 (0.0) 3.8 (0.0) 2.8 (0.0) 4.1 (0.0) 4.0 (0.0) 5.9 (0.0) 6.5 (0.0) 5.0 (0.0)
Mp3 Paraboloid 9 30 8.8 (0.0) 5.3 (0.0) 4.5 (0.1) 5.9 (0.0) 5.4 (0.0) 8.6 (0.0) 8.8 (0.0) 7.6 (0.0)

Table 4: Performance of lPCA on benchmark datasets, each consisting of 5000 points, and we report the
mean and standard deviation in the dimension estimate over 20 samples. The knn neighbourhood is fixed
to be k = 80, and the local dimension estimates are aggregated using the mean. Individual hyperparameters
of the thresholding methods are scikit-dimension defaults, as stated in Table D1.

scikit-dimension are both 0.05, producing a huge discrepancy between the two methods. However, both
methods are capable of producing the correct answer for the perfectly tuned choice of hyperparameter. It is
crucial that practitioners understand the potential sensitivity in the operation of estimators before applying
them.

3.5 Robustness of estimators to noise

For the noise experiments, datasets were constructed by drawing 2500 points from the dataset detailed in
Table 5. These points were subsequently corrupted with either Gaussian noise or outliers. The methodology
for outlier generation is fully described in Algorithm 1. Hyperparameter selection for the estimators is based

30
Figure 9: We consider 5000 points on M6 Nonlinear and fix the neighbourhood size at 50. Searching a
range of values for the hyperparameter α in both alphaFO and alphaRatio, we observe that the methods
mirror each other. Both can produce the correct dimension, 6, but only for a narrow bandwidth of the
hyperparameters. Outside of these bandwidths the estimates can vary significantly. The comparison shows
that although α = 0.05 is the default for both as the “out of the box” value in scikit-dimension, the
results are very different. Hence, a thorough understanding of what the hyperparameters represent and of
their potential importance is essential.

on values previously optimized for the S 10 dataset under noise-free conditions. The final results, presented
in the tables, represent medians calculated from 20 experimental repetitions.
The complete results are presented in Tables E1–E4. Some of the most notable findings are illustrated
in Figure 10.
The estimators’ behavior in the presence of noise varies significantly, depending on both the type and
magnitude of the applied noise. Furthermore, a comparison of the lPCA and MiND ML estimators reveals
that an estimator’s robustness cannot be solely judged by its performance on a single dataset. While all three
estimators perform exceptionally well for a 10-dimensional hypersphere (both with and without noise), a
small amount of ambient Gaussian noise causes a significant overestimation for a 6-dimensional hypersphere.
All tested estimators exhibit susceptibility to ambient Gaussian noise. This is particularly evident when
there is a substantial difference between the intrinsic dimension and the embedding dimension. In such
instances, even a very small noise level (with a standard deviation of 0.01) significantly alters the results for
some estimators (e.g., lPCA and MiND ML in Table E1).
It is crucial to note that the parameters used for these experiments were optimized for the uncorrupted
data. As some methods, such as CorrInt and GRIDE, have parameters which are intended to prevent selecting
neighbours on the noise scale, adjusting these parameters could potentially improve results for noisy data. Of
the tested estimators, CorrInt demonstrates the highest resistance to ambient Gaussian noise. Interestingly,
estimates from FisherS decrease as the standard deviation of the disturbances increases, even though the
disturbances are of a higher dimension than the initial dataset.
Most estimators demonstrated robustness to outliers, with the exceptions being FisherS. For the FisherS
estimator, the addition of outliers led to a significant reduction in dimension estimates.

3.6 Varying responses of estimators to curvature

The underlying manifold of data is unlikely to be entirely flat – if it is, then PCA will be a sufficient dimension
estimator. It is therefore important to understand how how curvature can affect dimension estimation.

31
Algorithm 1 Generate data set with outliers
1: procedure AddOutliers(D, nout )
2: Input: Clean dataset D with n observation of size d.
3: Number of outliers nout
4: Output: Datas set with outliers
5: out indices ← random choice([1, ..., n], nout ) ▷ Sample indices for outliers
6: for each index in out indices do
7: point ← D[index]
8: for i ← 1 to d do
9: m ← random(3, 6) ▷ Get random multiplier between 3 and 6
10: point[i] ← point[i] · m
11: end for
12: D[index] ← point
13: end for
14: return D
15: end procedure

Dataset d n
S6 6 11
S 10 10 11
SO(4) 6 16

Table 5: Datasets used for noise experiments. Intrinsic dimension of dataset and embedding dimensions are
marked respectively by d and n.

Figure 10: Left: We compare the average performance over 20 runs for the best performing hyperparamenters
against the “out of the box” hyperparameters for each estimator on the M6 Nonlinear data set with 5000
points. Right: We compare the average performance over 10 runs across two different types of noise, gaussian
noise with a standard deviation of 0.1, 125 outliers as described in Algorithm 1 and baseline of no noise. The
sample size of 2500 points on S 6 ⊂ R11 was used. The hyperparameters used were the best hyperparameters
for S 10 ⊂ R11 . Most can handle outliers well but struggle with Gaussian noise. For FisherS this is reversed.

32
Figure 11: Demonstration of how curvature affects PCA when points are drawn from a non-linear curve in
R2 . By comparison, using lPCA mitigates the effect of curvature. If the neighbourhoods are chosen too large
lPCA will also begin to struggle.

Consider, for example, the performance of PCA and of lPCA on points on drawn from a non-linear curve in
R2 as shown in Figure 11.
We investigate the effect of curvature on two pointwise dimension estimators, lPCA and MLE using a
selection of paraboloids as well as the standard embedding of a torus in R3 as illustrative examples. These
datasets are shown in Figures 12 to 14.
2
We consider a family of elliptic and hyperbolic paraboloids given by the equations 2x2 ± yb2 = z. One
principal curvature is fixed, while the second varies. We estimate the pointwise dimension at the point
(0, 0, 0). This is repeated 1500 times for each surface (sampled uniformly with the same density, so that
when b = 1 we sample 10000 points). The dimension is estimated using lPCA with a range of nearest
neighbour values (20-165 increasing in 5’s to give us 30 values of k). We count how many values of k give
an estimate of 2 and also the largest value of k that gives a dimension 2 estimate.
In Figure 15 we observe a clear improvement (more k values giving dimension 2 and larger values giving
dimension 2) as the Gauss curvature increases from being negative to being zero. This improvement continues
until the second principal curvature is κ2 ≃ 0.6. This suggests that for lPCA using “FO” and α = 0.05,
negative curvature creates an upward bias. If κ1 has a sign, then performance is best when κ2 ≃ κ81 , rather
than κ2 = 0. It would be informative to see if this relationship holds for different values of κ1 . We must
note that the standard deviations are large but the trend is very clear and visible in both statistics.
On the other hand, for MLE the results are quite different. The largest value of k considered always gives
a dimension estimate of 2. The amount of values of k yielding dimension 2 estimates on average is 27.5 to
29 out of 30. We study the effect of curvature here by instead averaging the pointwise dimension of (0, 0, 0)
over 150 runs for each k. This is shown in Figure 16.
In Figures 17 and 18 we plot the average pointwise dimension estimate for a fixed k against κ2 . We find
that for MLE, for all k, there is a trend to slightly underestimate the dimension as curvature increases. As k
increases, the estimate also decreases, as does the standard deviation which is not dependent on curvature.
However, for lPCA we see that for increasing values of k the estimates get worse, and for a fixed k we see
the same trend as in figure Figure 15, that the best estimates are made at a positive curvature. For lPCA
the standard deviations much more than for MLE.
To study the impact of curvature on the torus, we sampled 10000 points uniformly from the torus in
R3 . The degree of overestimation of each estimator is measured by counting the number of points with
pointwise estimated dimension 3 (rounded to the nearest integer). To examine how the estimate varies with
curvature, we plot the cumulative distribution of overestimated points against |ϕ − π|, where ϕ is the angle
of the inner circle from 0 to 2π. In these co-ordinates, |ϕ − π| = 0 corresponds to the inside of the torus
(negative curvature), π/2 is the top and bottom of the torus (zero curvature) and π represents the outside of
the torus (positive curvature. In particular, we observe that for lPCA, as k increases and the neighbourhood
captures more curvature, that pointwise dimension estimates of 3 appear on the inside of the torus (points
of negative curvature) and decrease in frequency towards the outside of the torus. This is captured in the

33
Figure 12: Points drawn from an elliptic paraboloid. The curvature at the origin is positive.

Figure 13: Points drawn from a hyperbolic paraboloid. The curvature at the origin is negative.

34
Figure 14: Sample from a torus. Red points having negative Gauss curvature, blue points have positive
curvature.

(a) Number of choices of k giving an estimate of 2 (b) Largest choice of k giving an estimate of 2

Figure 15: Both figures show the same trend. As curvature increases from negative to positive, the number
of choices of k and the largest choice of k giving an estimate of 2 both increase. This trend continues past
Gauss curvature 0. At κ2 ≃ 0.6 this trend reverses. In the case of the torus below, κ2 ≤ 1 making this
reversal hard to detect.

35
Figure 16: Average pointwise dimension at (0,0,0) for MLE for different for data from the positively curved
2 2
2x2 + yb2 = z on the left and from the negatively curved 2x2 − yb2 = z on the right. The same trend appears
in both, that as k is increased the pointwise estimates decrease.

Figure 17: For MLE there is a trend to underestimate as either κ2 or the neighbourhood size k is increased.
The standard deviation stays reasonably constant for changes of κ2 , while it decreases with k.

Figure 18: For lPCA we recover the same trend in the estimate with changes in κ2 which we saw in Figure 15
for k ≥ 40. We also see the high standard deviations in the estimate which occurs as we are only considering
one point for each dataset.

36
(a) lPCA (b) MLE

Figure 19: Using a torus embedded in R3 as an example, we investigate the tendency to overestimate
dimension as the local geometry is varied. lPCA and MLE exhibit divergent responses. The x-axis represents
the position of points on the torus, with 0 representing the innermost latitude (negative curvature), π/2 the
top and bottom of the torus (flat), and π the outermost latitude (positive curvature). Larger neighbourhood
size n results in overestimation by lPCA, but better accuracy for MLE. For lPCA this phenomenon is first
apparent in negative curvature and, as n grows, it spreads to the positively curved region. In the worst
case MLE performs significantly better than lPCA. The figure also emphasises the great importance of
hyperparameter choice.

decrease in gradient in Figure 19 (a).

However, the MLE estimator does not show this distinction. There is an even spread of points with
overestimated dimension regardless of the curvature (note the gradient is increasing in the figure as there
are more points on the outer part of the torus). As k increases and neighbourhoods become larger, and less
like flat discs, we in fact see a reduction in the number of overestimates.
One potential explanation for MLE’s outperformance of lPCA at this task is that lPCA, by design, seeks
out an affine subspace which approximates the manifold. The better the manifold is approximated by a
linear embedding, the better lPCA will perform, and curvature is directly detrimental to this. This suggests
that, in general, any tangential estimator is likely to be vulnerable to the effects of curvature. MLE, on the
other hand, depends only on radial distances from a single sampled point. Curvature has a tendency to
reduce those distances (regardless of the sign of curvature). This may explain why the precise nature of the
curvature at different points does not seems to have a significant impact on the results. Again, it appears
likely that any estimator based on radial distances, as most of the parametric estimators are, will be more
robust to curvature than tangential estimators.

4 Conclusion and Future Work

Our survey illustrates the difficulty in designing good dimension estimators that are robust to noise and
accommodate the geometries of different spaces. A practical problem is the appropriate choice of hyper-
parameters, which can significantly bias the dimension estimate. There is no single estimator or set of
hyperparameters that can perform well across all settings. Therefore, we recommend that practitioners use
a range of estimators and hyperparameters to acknowledge the effect of prior assumptions on the inferred
dimension. Dimension estimators should also be used alongside other tools, such as curvature and topologi-
cal summaries, to contextualise the extent to which the dimension estimates are reliable. We believe future
research on techniques to guide the choice of estimators and hyperparameters would be extremely valuable.
Where limited data is an issue, we find that global estimators are generally more suitable. Local estimators
need more samples, as each neighbourhood must have a sufficient number of sampled points. However, global
estimators are more susceptible to randomness in sampling, tending to have higher variance.
Our classification of estimators into broad families – tangential, parametric and those using topological

37
and metric invariants – demonstrates that there are a variety of perspectives from which dimension estimation
can be approached. We find that the best estimators tend to lie in the parametric family. Persistent homology
provides a reasonably successful estimator from a topological perspective.
A tendency towards underestimation for datasets of higher dimensions is confirmed. However, we caution
against generalisation. Experiments on data drawn from SO(n) often demonstrate overestimation. We
believe the cause of underestimations comes principally from concentration of measure near the boundary
and from finite size effects. Empirical corrections attempt to reverse this underestimation tendency but, if
the dataset under study is not similar to the ones used to calculate the correction factor, this can lead to
large errors.
The range of datasets used allowed us to assess the estimators over a wide variety of desirable criteria.
We find that no estimators provide a satisfactory level of performance on non-linear datasets with dimension
above six. However, ESS performs strongly on all other criteria. We strongly recommend that future
researchers include ESS as a comparison method to benchmark performance.
Our additions to the methods of scikit-dimension: GeoMLE, GRIDE, CDim, WODCap, Camastra &
Vinciarelli’s extension to the Grassberger–Procaccia CorrInt algorithm, the packing-number based estimator,
and the magnitude and PH0 estimators, expand the methods practitioners can draw from in an easily
accessible place. We have also added new functionality for lPCA and MLE, so that users can choose ϵ-
neighbourhoods in addition to knn neighbourhoods, giving greater freedom to practitioners. We have also
added a probabilistic thresholding method for PCA [78].
Given the increasing use of PH0 and mag dimension estimators in the applied topology community,
especially on machine learning problems [5, 69], we give a more detailed investigation and benchmarking of
performance on these estimators. While PH0 performs comparably to other estimators investigated here, the
estimation of magnitude dimension suffers from finite size effects, which is detailed in Appendix B.
We demonstrate that the choice of hyperparameters is crucial and that is is essential for practitioners to
understand the role they play for each estimator. Our recommendation is that a range of hyperparameters
be used to build confidence in the result. We also recommend developers of dimension estimators consider
theoretical limits that hyperparameter choices on the range of dimension estimates it can obtain, as we
have identified a throttling phenomenon that results from poor choices of hyperparameters across several
estimators.
Gaussian noise presents an issue for most estimators. However, local estimators can overcome certain
types of outliers through aggregation.
We provide evidence that the role of curvature plays an important role in estimations. We confirming
the known negative effects of curvature on lPCA, but find that, for at least one aggregation method, slightly
positively curved surfaces can be easier to estimate that those with zero Gauss curvature. The effects of
curvature MLE are much lower that the effect of lPCA, which may generalise to a statement that parametric
estimators are more robust to curvature than tangential estimators.
Areas that require major progress within this field are estimator performance on non-linear manifolds and
high dimensional manifolds, as well as the development of practical ways to guide hyperparameter choice.
Alternatively, it would be worth investigating adaptive methods, automating the choice of hyperparameters
using features of the input data. The limited and varied results shown here on the role of curvature clearly
justify a systematic approach to determining to what extent curvature has an impact on dimension estimators.

Acknowledgements
JB: This work was supported by the Additional Funding Programme for Mathematical Sciences, delivered
by EPSRC (EP/V521917/1) and the Heilbronn Institute for Mathematical Research. PD and JM: This work
was supported by Dioscuri program initiated by the Max Planck Society, jointly managed with the National
Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and
the German Federal Ministry of Education and Research. JH and KMY: This work was supported by a
UKRI Future Leaders Fellowship [grant number MR/W01176X/1; PI J Harvey]. This material is based in
part upon work supported by the National Science Foundation under Grant No. DMS-1928930, while JH
was in residence at the Simons Laufer Mathematical Sciences Institute in Berkeley, California, during Fall
2024.

38
References
[1] Henry Adams, Elin Farnell, Manuchehr Aminian, Michael Kirby, Joshua Mirth, Rachel Neville, Chris
Peterson, and Clayton Shonkwiler. A Fractal Dimension for Measures via Persistent Homology. In
Nils A Baas, Gunnar E Carlsson, Gereon Quick, Markus Szymik, and Marius Thaule, editors, Topo-
logical Data Analysis, pages 1–31, Cham, 2020. Springer International Publishing.
[2] Luca Albergante, Jonathan Bac, and Andrei Zinovyev. Estimating the effective dimension of large
biological datasets using Fisher separability analysis. In 2019 International Joint Conference on Neural
Networks (IJCNN), pages 1–8, Budapest, 2019. IEEE.
[3] Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E. Houle, Ken ichi
Kawarabayashi, and Michael Nett. Extreme-value-theoretic estimation of local intrinsic dimension-
ality. Data Mining and Knowledge Discovery, 32(6):1768–1805, 11 2018.

[4] Laurent Amsaleg, Oussama Chelly, Michael E Houle, Miloš Radovanovi, and Weeris Treeratanajaru.
Intrinsic Dimensionality Estimation within Tight Localities. In Proceedings of the 2019 SIAM Inter-
national Conference on Data Mining (SDM), pages 181–189, Calgary, 2019. SIAM.
[5] Rayna Andreeva, Katharina Limbeck, Bastian Rieck, and Rik Sarkar. Metric Space Magnitude and
Generalisation in Neural Networks. In Timothy Doster, Tegan Emerson, Henry Kvinge, Nina Miolane,
Mathilde Papillon, Bastian Rieck, and Sophia Sanborn, editors, Proceedings of 2nd Annual Workshop
on Topology, Algebra, and Geometry in Machine Learning (TAG-ML), volume 221 of Proceedings of
Machine Learning Research, pages 242–253. PMLR, 8 2023.
[6] Jonathan Bac, Evgeny M. Mirkes, Alexander N. Gorban, Ivan Tyukin, and Andrei Zinovyev. Scikit-
Dimension: A Python Package for Intrinsic Dimension Estimation. Entropy, 23(10):1368, 10 2021.

[7] Mukund Balasubramanian and Eric L. Schwartz. The isomap algorithm and topological stability.
Science, 295(5552), 2002.
[8] Jillian Beardwood, J. H. Halton, and J. M. Hammersley. The shortest path through many points.
Mathematical Proceedings of the Cambridge Philosophical Society, 55(4):299–327, 10 1959.

[9] Zsigmond Benkő, Marceli Stippinger, Roberta Rehus, Attila Bencze, Dániel Fabó, Boglárka Hajnal,
Loránd G. Eröss, András Telcs, and Zoltán Somogyvári. Manifold-adaptive dimension estimation
revisited. PeerJ Computer Science, 8, 2022.
[10] Tolga Birdal, Aaron Lou, Leonidas J Guibas, and Umut Simsekli. Intrinsic Dimension, Persistent Ho-
mology and Generalization in Neural Networks. In M Ranzato, A Beygelzimer, Y Dauphin, P S Liang,
and J Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34,
pages 6776–6789. Curran Associates, Inc., 2021.
[11] Christopher Bishop. Bayesian PCA. Advances in Neural Information Processing Systems, 11, 1998.
[12] Christopher J. Bishop and Yuval Peres. Fractals in Probability and Analysis. Cambridge University
Press, 2016.
[13] Adam Block, Zeyu Jia, Yury Polyanskiy, and Alexander Rakhlin. Intrinsic Dimension Estimation
Using Wasserstein Distance. Journal of Machine Learning Research, 23:1–37, 2022.
[14] Andreas Buja and Nermin Eyuboglu. Remarks on Parallel Analysis. Multivariate Behavioral Research,
27(4):509–540, 10 1992.

[15] Francesco Camastra. Data dimensionality estimation methods: a survey. Pattern Recognition,
36(12):2945–2954, 12 2003.
[16] Francesco Camastra and Antonino Staiano. Intrinsic dimension estimation: Advances and open prob-
lems. Information Sciences, 328:26–41, 1 2016.

39
[17] Francesco Camastra and Alessandro Vinciarelli. Intrinsic Dimension Estimation of Data: An Approach
Based on Grassberger–Procaccia’s Algorithm. Neural Processing Letters, 14(1):27–34, 8 2001.
[18] P. Campadelli, E. Casiraghi, C. Ceruti, and A. Rozza. Intrinsic Dimension Estimation: Relevant
Techniques and a Benchmark Framework. Mathematical Problems in Engineering, 2015:1–21, 2015.

[19] Luca Candelori, Alexander G. Abanov, Jeffrey Berger, Cameron J. Hogan, Vahagn Kirakosyan, Kharen
Musaelian, Ryan Samson, James E.T. Smith, Dario Villani, Martin T. Wells, and Mengjia Xu. Robust
estimation of the intrinsic dimension of data sets with quantum cognition machine learning. Scientific
Reports, 15(1), 12 2025.
[20] Claudio Ceruti, Simone Bassis, Alessandro Rozza, Gabriele Lombardi, Elena Casiraghi, and Paola
Campadelli. DANCo: An intrinsic dimensionality estimator exploiting angle and norm concentration.
Pattern Recognition, 47(8):2569–2581, 2014.
[21] Siu-Wing Cheng and Man-Kwun Chiu. Dimension Detection via Slivers. In Proceedings of the Twen-
tieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1001–1010. Society for Industrial
and Applied Mathematics, 2009.

[22] Jose A. Costa and Alfred O. Hero. Entropic graphs for manifold learning. In Conference Record of the
Asilomar Conference on Signals, Systems and Computers, volume 1, 2003.
[23] Jose A. Costa and Alfred O. Hero. Geodesic entropic graphs for dimension and entropy estimation in
manifold learning. IEEE Transactions on Signal Processing, 52(8):2210–2221, 8 2004.

[24] Jose A. Costa and Alfred O. Hero. Learning intrinsic dimension and intrinsic entropy of high-
dimensional datasets. In European Signal Processing Conference, volume 06-10-September-2004, 2015.
[25] J. M. Craddock and C. R. Flood. Eigenvectors for representing the 500 mb geopotential surface over
the Northern Hemisphere. Quarterly Journal of the Royal Meteorological Society, 95(405):576–593, 7
1969.

[26] Francesco Denti, Diego Doimo, Alessandro Laio, and Antonietta Mira. The generalized ratios intrinsic
dimension estimator. Scientific Reports, 12(1):20005, 11 2022.
[27] Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, and Antonietta Mira. Beyond the
noise: intrinsic dimension estimation with optimal neighbourhood identification. arXiv:2405.15132v2,
5 2024.

[28] Kevin Dunne. Metric Space Spread, Intrinsic Dimension and the Manifold Hypothesis.
arXiv:2308.01382, 8 2023.
[29] Benjamin Dupuis, George Deligiannidis, and Umut Simsekli. Generalization bounds using data-
dependent fractal dimensions. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engel-
hardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference
on Machine Learning, pages 8922–8968, Honolulu, 2023. PMLR.
[30] J.-P. Eckmann and D. Ruelle. Fundamental limitations for estimating dimensions and Lyapunov
exponents in dynamical systems. Physica D: Nonlinear Phenomena, 56(2-3):185–187, 5 1992.
[31] Vittorio Erba, Marco Gherardi, and Pietro Rotondo. Intrinsic dimension estimation for locally under-
sampled data. Scientific Reports, 9(1):17133, 11 2019.
[32] Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension
of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, 9 2017.
[33] Kenneth Falconer. Alternative Definitions of Dimension. In Fractal Geometry, chapter 3, pages 39–58.
John Wiley & Sons, Ltd, 2003.

40
[34] Mingyu Fan, Nannan Gu, Hong Qiao, and Bo Zhang. Intrinsic dimension estimation of data by
principal component analysis. arXiv, 2 2010.
[35] Amir massoud Farahmand, Csaba Szepesvári, and Jean-Yves Audibert. Manifold-adaptive dimension
estimation. In Proceedings of the 24th international conference on Machine learning, pages 265–272,
New York, NY, USA, 6 2007. ACM.

[36] Serge Frontier. Étude de la décroissance des valeurs propres dans une analyse en composantes prin-
cipales: Comparaison avec le modèle du bâton brisé. Journal of Experimental Marine Biology and
Ecology, 25(1):67–75, 11 1976.
[37] K. Fukunaga and D.R. Olsen. An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE
Transactions on Computers, C-20(2):176–183, 2 1971.
[38] Xin Geng, De Chuan Zhan, and Zhi Hua Zhou. Supervised nonlinear dimensionality reduction for
visualization and classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cy-
bernetics, 35(6):1098–1107, 12 2005.
[39] Benyamin Ghojogh, Mark Crowley, Fakhri Karray, and Ali Ghodsi. Elements of Dimensionality Re-
duction and Manifold Learning. Springer International Publishing, Cham, 2023.
[40] Dejan Govc and Richard Hepworth. Persistent magnitude. Journal of Pure and Applied Algebra,
225(3):106517, 3 2021.
[41] Daniele Granata and Vincenzo Carnevale. Accurate Estimation of the Intrinsic Dimension Using Graph
Distances: Unraveling the Geometric Complexity of Datasets. Scientific Reports, 6, 8 2016.
[42] Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. Physica D:
Nonlinear Phenomena, 9(1):189–208, 1983.
[43] Alfred Gray. Comparison theorems for the volumes of tubes as generalizations of the Weyl tube formula.
Topology, 21(2):201–228, 1982.

[44] Hein Matthias and Audibert Jean-Yves. Intrinsic Dimensionality Estimation of Submanifolds in $Rˆd$.
In Proceedings of the 22nd International Conference on Machine Learning, pages 289–296, Bonn, 2005.
Association for Computing Machinery.
[45] Christian Horvat and Jean-Pascal Pfister. Intrinsic dimensionality estimation using Normalizing Flows.
In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors, Advances in Neural
Information Processing Systems, volume 35, pages 12225–12236. Curran Associates, Inc., 2022.
[46] Alexander Ivanov, Gleb Nosovskiy, Alexey Chekunov, Denis Fedoseev, Vladislav Kibkalo, Mikhail
Nikulin, Fedor Popelenskiy, Stepan Komkov, Ivan Mazurenko, and Aleksandr Petiushko. Manifold
Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension
Estimation. arXiv:2107.03903, 7 2021.
[47] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.
In Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, editors, Conference on Modern
Analysis and Probability, pages 189–206. American Mathematical Society, Providence, RI, 1984.
[48] Kerstin Johnsson, Charlotte Soneson, and Magnus Fontes. Low bias local intrinsic dimension estimation
from expected simplex skewness. IEEE Transactions on Pattern Analysis and Machine Intelligence,
37(1):196–202, 1 2015.
[49] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 2002.
[50] Iolo Jones. Manifold Diffusion Geometry: Curvature, Tangent Spaces, and Dimension.
arXiv:2411.04100, 11 2024.

41
[51] Zaher Joukhadar, Hanxun Huang, and Sarah Monazam Erfani. Bayesian Estimation Approaches
for Local Intrinsic Dimensionality. In Edgar Chávez, Benjamin Kimia, Jakub Lokoč, Marco Patella,
and Jan Sedmidubsky, editors, Similarity Search and Applications, volume 15268 of Lecture Notes in
Computer Science, pages 111–125, Cham, 2025. Springer Nature Switzerland.
[52] Henry F Kaiser. The Application of Electronic Computers to Factor Analysis. Educational and
Psychological Measurement, 20(1):141–151, 1960.
[53] Hamidreza Kamkari, Brendan Leigh, Ross Rasa Hosseinzadeh, Jesse C Cresswell, and Gabriel Loaiza-
Ganem. A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with
Diffusion Models. Technical report, 2024.

[54] Rasa Karbauskaite and Gintautas Dzemyda. Fractal-Based Methods as a Technique for Estimating
the Intrinsic Dimensionality of High-Dimensional Data: A Survey. Informatica. 2016;27(2):257-281,
27(2):257–281, 2016.
[55] Tommi Kärkkäinen and Jan Hänninen. Additive autoencoder for dimension estimation. Neurocomput-
ing, 551, 9 2023.

[56] Hirokazu Katsumasa, Emily Roff, and Masahiko Yoshinaga. Is magnitude ’generically continuous’ for
finite metric spaces? arXiv:2501.08745, 1 2025.
[57] Balázs Kégl. Intrinsic Dimension Estimation Using Packing Numbers. In S Becker, S Thrun, and
K Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 697–704. MIT
Press, 2002.

[58] Harry Kesten and Sungchul Lee. The central limit theorem for weighted minimal spanning trees on
random points. The Annals of Applied Probability, 6(2):495–527, 1996.
[59] Jisu Kim, Alessandro Rinaldo, and Larry Wasserman. Minimax rates for estimating the dimension of
a manifold. Journal of Computational Geometry, 10(1):42–95, 2019.

[60] Matthäus Kleindessner and Ulrike Luxburg. Dimensionality estimation without distances. In Guy
Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference
on Artificial Intelligence and Statistics, pages 471–479, 2015.
[61] Gady Kozma, Zvi Lotker, and Gideon Stupp. The minimal spanning tree and the upper box dimension.
Proceedings of the American Mathematical Society, 134(4):1183–1187, 9 2005.

[62] Gady Kozma, Zvi Lotker, and Gideon Stupp. On the connectivity threshold for general uniform metric
spaces. Information Processing Letters, 110(10):356–359, 4 2010.
[63] Anna Krakovská and Martina Chvosteková. Simple correlation dimension estimator and its use to
detect causality. Chaos, Solitons and Fractals, 175, 10 2023.

[64] J. B. Kruskal. Nonmetric Multidimensional Scaling: A Numerical Method. Psychometrika, 29(2):115–

129, 6 1964.
[65] Michel Ledoux. The concentration of measure phenomenon. Mathematical surveys and monographs,,
89, 2001.

[66] Tom Leinster and Simon Willerton. On the asymptotic magnitude of subsets of Euclidean space.
Geometriae Dedicata, 164(1):287–310, 6 2013.
[67] Elizaveta Levina and Peter J Bickel. Maximum Likelihood Estimation of Intrinsic Dimension. In
Advances in Neural Information Processing Systems 17, 2004.
[68] Uzu Lim, Harald Oberhauser, and Vidit Nanda. Tangent Space and Dimension Estimation with the
Wasserstein Distance. SIAM Journal on Applied Algebra and Geometry, 8:650–685, 10 2024.

42
[69] Katharina Limbeck, Rayna Andreeva, Rik Sarkar, and Bastian Rieck. Metric space magnitude for
evaluating the diversity of latent representations. In NIPS ’24: Proceedings of the 38th International
Conference on Neural Information Processing Systems, pages 123911–123953, Vancouver, BC, Canada,
2025. Curran Associates Inc.
[70] Tong Lin and Hongbin Zha. Riemannian manifold learning. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 30(5):796–809, 5 2008.
[71] Gabriele Lombardi, Alessandro Rozza, Claudio Ceruti, Elena Casiraghi, and Paola Campadelli. Mini-
mum Neighbor Distance Estimators of Intrinsic Dimension. In D Gunopulos, T Hofmann, D Malerba,
and M Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases. ECML PKDD
2011., volume 6912, pages 374–389. Springer, 2011.

[72] David J.C. MacKay and Zoubin Ghahramani. Comments on ’Maximum Likelihood Estimation of
Intrinsic Dimension’ by E. Levina and P. Bickel (2004), 2005.
[73] Iuri Macocco, Aldo Glielmo, Jacopo Grilli, and Alessandro Laio. Intrinsic Dimension Estimation for
Discrete Metrics. Physical Review Letters, 130(6), 2 2023.

[74] Pertti Mattila. Geometry of sets and measures in Euclidean spaces. Cambridge University Press,
Cambridge, 1995.
[75] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform Manifold
Approximation and Projection. Journal of Open Source Software, 3(29), 2018.

[76] Mark W. Meckes. Positive definite metric spaces. Positivity, 17(3):733–757, 9 2013.
[77] Mark W. Meckes. Magnitude, Diversity, Capacities, and Dimensions of Metric Spaces. Potential
Analysis, 42(2):549–572, 2 2015.
[78] Thomas P. Minka. Automatic choice of dimensionality for PCA. In Advances in Neural Information
Processing Systems, 2001.

[79] James Raymond Munkres. Topology (2nd Edition). Prentice Hall, Inc, 2000.
[80] Artem R. Oganov and Mario Valle. How to quantify energy landscapes of solids. Journal of Chemical
Physics, 130(10), 2009.
[81] Miguel O’Malley, Sara Kalisnik, and Nina Otter. Alpha magnitude. Journal of Pure and Applied
Algebra, 227(11):107396, 11 2023.
[82] Kadir Özçoban, Murat Manguoğlu, and Emrullah Fatih Yetkin. A Novel Approach for Intrinsic Di-
mension Estimation. arXiv:2503.09485v1 [cs.LG] 12 Mar 2025, 3 2025.
[83] Panagiotis G. Papaioannou, Ronen Talmon, Ioannis G. Kevrekidis, and Constantinos Siettos. Time-
series forecasting using manifold learning, radial basis function interpolation, and geometric harmonics.
Chaos, 32(8), 8 2022.
[84] Karl W. Pettis, Thomas A. Bailey, Anil K. Jain, and Richard C. Dubes. An Intrinsic Dimensionality
Estimator from Near-Neighbor Information. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-1(1):25–37, 1979.

[85] Haiquan Qiu, Youlong Yang, and Benchong Li. Intrinsic dimension estimation based on local adjacency
information. Information Sciences, 558:21–33, 5 2021.
[86] Haiquan Qiu, Youlong Yang, and Hua Pan. Underestimation modification for intrinsic dimension
estimation. Pattern Recognition, 140, 8 2023.

[87] Haiquan Qiu, Youlong Yang, and Saeid Rezakhah. Intrinsic dimension estimation method based on
correlation dimension and kNN method. Knowledge-Based Systems, 235, 1 2022.

43
[88] Davide Risso, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean Philippe Vert. A gen-
eral and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications,
9(1), 12 2018.
[89] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel high intrinsic dimensionality
estimators. Machine Learning, 89(1-2):37–65, 10 2012.

[90] Alessandro Rozza, Gabriele Lombardi, Marco Rosa, Elena Casiraghi, and Paola Campadelli. IDEA:
Intrinsic Dimension Estimation Algorithm. In Image Analysis and Processing – ICIAP 2011. Springer,
Berlin, Heidelberg., 2011.
[91] John W Sammon. A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Com-
puters, C-18(5):401–09, 1969.
[92] Aaron Schumacher. Estimate manifold dimensionality with LID, 12 2020.
[93] Benjamin Schweinhart. Fractal dimension and the persistent homology of random geometric complexes.
Advances in Mathematics, 372:107291, 10 2020.

[94] Benjamin Schweinhart. Persistent Homology and the Upper Box Dimension. Discrete & Computational
Geometry, 65(2):331–364, 2021.
[95] Paulo Serra and Michel Mandjes. Dimension Estimation Using Random Connection Models. Journal
of Machine Learning Research, 18(138):1–35, 2017.
[96] Umut Simsekli, Ozan Sener, George Deligiannidis, and Murat A Erdogdu. Hausdorff dimension, heavy
tails, and generalization in neural networks. Advances in Neural Information Processing Systems,
33:5138–5151, 2020.
[97] Primoz Skraba and Katharine Turner. Wasserstein Stability for Persistence Diagrams.
arXiv:2006.16824, 7 2025.

[98] Jan Pawel Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Schönlieb. Diffusion Models
Encode the Intrinsic Dimension of Data Manifolds. In Ruslan Salakhutdinov, Zico Kolter, Katherine
Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of
the 41st International Conference on Machine Learning, pages 46412–46440. PMLR, 2024.
[99] J. Michael Steele. Growth Rates of Euclidean Minimal Spanning Trees with Power Weighted Edges.
The Annals of Probability, 16(4), 10 1988.
[100] Xing Sun and Andrew B Nobel. On the Size and Recovery of Submatrices of Ones in a Random Binary
Matrix. Journal of Machine Learning Research, 9(80):2431–2453, 2008.
[101] Oliver J Sutton, Qinghua Zhou, Alexander N Gorban, and Ivan Y Tyukin. Relative Intrinsic Di-
mensionality Is Intrinsic to Learning. In Lazaros Iliadis, Antonios Papaleonidas, Plamen Angelov,
and Chrisina Jayne, editors, Artificial Neural Networks and Machine Learning – ICANN 2023, volume
14254 of Lecture Notes in Computer Science, pages 516–529, Cham, 2023. Springer Nature Switzerland.
[102] Piotr Tempczyk, Rafal Michaluk, Lukasz Garncarek, Przemyslaw Spurek, Jacek Tabor, and Adam
Golinski. LIDL: Local Intrinsic Dimension Estimation Using Approximate Likelihood. In Kamalika
Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceed-
ings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine
Learning Research, pages 21205–21231. PMLR, 8 2022.
[103] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A Global Geometric Framework for Non-
linear Dimensionality Reduction. Science, 290:2319–2323, 2000.
[104] G. V. Trunk. Statistical estimation of the intrinsic dimensionality of data collections. Information and
Control, 12(5):508–525, 5 1968.

44
[105] Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko,
Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. Intrinsic dimension estimation for
robust detection of ai-generated texts. Advances in Neural Information Processing Systems, 36:39257–
39276, 2023.
[106] Andrey Tychonoff. Ein Fixpunktsatz. Mathematische Annalen, 11:767–776, 1935.

[107] Laurens Van Der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine
Learning Research, 9, 2008.
[108] Jan van Mill. Infinite-Dimensional Topology. North-Holland, 1st edition, 1988.
[109] Peter J. Verveer and Robert P.W. Duin. An Evaluation of Intrinsic Dimensionality Estimators. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(1):81–86, 1995.
[110] Gregory P. Way, Michael Zietz, Vincent Rubinetti, Daniel S. Himmelstein, and Casey S. Greene.
Compressing gene expression data using multiple latent space dimensionalities learns complementary
biological representations. Genome Biology, 21(1), 5 2020.

[111] Simon Willerton. Heuristic and computer calculations for the magnitude of metric spaces.
arXiv:0910.5500, 10 2009.
[112] Xin Yang, Sebastien Michea, and Hongyuan Zha. Conical dimension as an intrisic dimension estimator
and its applications. In Chid Apte, David Skillicorn, Bing Liu, and Srinivasan Parthasarathy, editors,
Proceedings of the 2007 SIAM International Conference on Data Mining, pages 169–179, Minneapolis,
2007. SIAM.
[113] Eric Yeats, Cameron Darwin, Frank Liu, and Hai Li. Adversarial Estimation of Topological Dimension
with Harmonic Score Maps. arXiv:2312.06869, 12 2023.
[114] Joseph E. Yukich. Probability Theory of Classical Euclidean Optimization Problems. Springer Berlin,
Heidelberg, 1998.

[115] Wenlan Zang. Abstract Latent-Space Construction for Analyzing Large Genomic Data Sets. PhD
thesis, Yale University, 2021.

45
A Comparison of PH and KNN
Since PH0 and KNN are derived from the common theory of Euclidean functionals, and are similar in
construction, we highlight a comparison in their performance on the benchmark set of datasets. We first
discuss some theoretical advantages of PH0 . The key difference between the two estimators is that PH0 further
processes the distance information by using the minimum spanning tree (or zeroth dimensional persistent
homology) of the point set. The minimum spanning tree takes the global connectivity into account; in
comparison, KNN considers edge lengths along the knn graph, which is a much coarser organisation of the
connectivity of the point cloud compared to PH0 . One key benefit of PH0 is the stability property with respect
to the perturbation of points conferred to the α-weight of the minimum spanning tree [97]. In addition, the
minimum spanning tree of PH0 does not rely on assumptions about a suitable local neighbourhood size,
which is required as a hyperparameter in KNN in the construction of the knn graph.
In our empirical observations on the benchmark dataset, while the mean estimates of both estimators
are comparable, we see that KNN is more susceptible to the randomness of the point sample, and can have
much larger variance, especially for higher dimensional datasets.
To investigate the effects of hyperparameter choice, in Table A1, we show how estimates of spheres for
different dimensions vary for different choices of α. In the main text, Figure 6 shows the performance of
KNN as k is varied on the benchmark dataset.

α 0.5 0.75 1.0 1.25 1.5 1.75 2.0

S2 2.02(0.04) 2.01(0.04) 2.01(0.04) 2.01(0.04) 2.01(0.04) 2.00(0.04) 2.00(0.04)
S4 4.02(0.13) 4.00(0.13) 4.00(0.13) 4.00(0.12) 4.00(0.12) 4.00(0.12) 4.00(0.12)
S8 7.69(0.18) 7.63(0.17) 7.60(0.17) 7.58(0.17) 7.56(0.17) 7.55(0.17) 7.53(0.17)

Table A1: Dimension estimates from PH0 , given samples of 1000 points on spheres of different dimensions,
for different choices of the power parameter α.

46
2500 samples (abs)
M1_Sphere M2_Affine_3to5 M3_Nonlinear_4to6 M4_Nonlinear M5a_Helix1d M5b_Helix2d
11.0 3.000
1.04
2.975 4.00 4.00
10.5 1.03 2.15
2.950 3.95 3.95 1.02
10.0 2.925 1.01 2.10
3.90 3.90
2.900 1.00
9.5 2.05
2.875 3.85 3.85 0.99
9.0 2.850 0.98
3.80 3.80 2.00

M6_Nonlinear M7_Roll M8_Nonlinear M9_Affine M10a_Cubic M10b_Cubic

6.2 2.08 14.0 20 10.0 17
6.1 2.06 19 9.8
13.5 16
6.0 2.04 18 9.6
5.9 2.02 13.0 17 15
9.4
5.8 2.00 12.5 16 9.2 14
5.7 1.98 15 9.0
12.0
5.6 1.96 14 8.8 13
5.5 1.94 11.5 13 8.6 12
M10c_Cubic M10d_Cubic M11_Moebius M12_Norm M13a_Scurve M13b_Spiral
24 2.050
80 2.08 1.04
2.025
60 20 2.06 1.03
22 2.000
40 2.04 1.02
20 1.975 19
2.02 1.01
20 1.950
18 2.00 1.00
0 1.925 18
1.98 0.99
16 20 1.900
17 1.96 0.98
1.875
Mbeta Mn1_Nonlinear Mn2_Nonlinear Mp1_Paraboloid Mp2_Paraboloid Mp3_Paraboloid
10 18 24 3.00 6.0 9.0
17 2.95 5.8 8.5
9 22
16 2.90 8.0 mean
20 5.6 Truth
8 2.85 IQR
15
5.4 7.5
7 14 18 2.80
5.2 7.0
13 16 2.75
KNN PH KNN PH KNN PH KNN PH KNN PH KNN PH

Figure A1: Performance of PH0 and KNN estimators on benchmark manifolds with 2500 samples. The
hyperparameters chosen here are fixed across all datasets, with α = 0.5 and k = 1 for PH0 and KNN
respectively. Both choices are the ones that minimises the median (as taken across the benchmark manifolds)
absolute and relative error in the mean dimension estimate. The error bars indicate the interquartile range
and median dimension estimate. The red cross indicates the mean dimension estimate, and the dashed
line indicates the true dimension. We note that the KNN estimator consistently has greater variance when
compared to PH, and can output extreme outliers in high dimensional cases such as M10d Cubic.

47
Best performance for 2500 samples
M1_Sphere M2_Affine_3to5 M3_Nonlinear_4to6 M4_Nonlinear M5a_Helix1d M5b_Helix2d
11.0 3.000 4.00 1.015
2.975 4.00 1.010
10.5 3.95 2.15
2.950 3.95 1.005
3.90 2.10
10.0 2.925
3.90 1.000
2.900 3.85
9.5 0.995 2.05
2.875 3.85 3.80
0.990
9.0 2.850 3.75 2.00
3.80 0.985
M6_Nonlinear M7_Roll M8_Nonlinear M9_Affine M10a_Cubic M10b_Cubic
2.08 14.0 20 10.0 17.0
6.15
2.06 13.5 9.8 16.5
19
6.10 2.04 9.6 16.0
2.02 13.0 18
6.05 9.4 15.5
2.00 12.5 9.2
17 15.0
6.00 1.98 9.0
12.0 14.5
5.95 1.96 16 8.8
1.94 11.5 14.0
8.6
M10c_Cubic M10d_Cubic M11_Moebius M12_Norm M13a_Scurve M13b_Spiral
24 300 2.02 1.04
20 2.02
23 200 2.01 1.03
22 100 2.00 2.00 1.02
19
0 1.01
21 1.99 1.98
100 18 1.00
20 1.98
200 1.96 0.99
19 300 1.97 17 0.98

Mbeta Mn1_Nonlinear Mn2_Nonlinear Mp1_Paraboloid Mp2_Paraboloid Mp3_Paraboloid

10.0 18 24 3.00 6.0 9.0
9.5
23 2.95 5.8 8.5
9.0 17
8.5 22 2.90 mean
5.6 8.0 Truth
8.0 16 21
2.85 IQR
7.5 20 5.4 7.5
7.0 15 2.80
19 7.0
6.5 5.2
18 2.75
KNN PH KNN PH KNN PH KNN PH KNN PH KNN PH

Figure A2: Performance of PH0 and KNN estimators on benchmark manifolds with 2500 samples. The
hyperparameters chosen here are different across the datasets, and are chosen to be the the ones minimises the
difference between the mean dimension estimate and ground truth. The error bars indicate the interquartile
range and median dimension estimate. The red cross indicates the mean dimension estimate, and the dashed
line indicates the true dimension. We note that the KNN estimator consistently has greater variance when
compared to PH0 , though there it is often more accurate given the right hyperparameter choice.

48
B Finite size issues of magnitude dimension
Focussing on the practicalities, finite size issues can affect the inference of magnitude dimension from finite
samples, since dimMag (X) = 0. Because |tX| → |X| as t → ∞, the slope of the line approaches zero. In
practice, while the range of t over which we fit the curve must be large enough to approximate the limit, it
cannot be so large that the finite size effect occurs. If there are too few points sampled from X, then the
finite size effect takes over before t can be large enough for the asymptotic behaviour to emerge. This means
the number of points may need to be quite large for the dimension to be read off the empirical curve log |tX|.
We demonstrate this in an experiment with uniform random samples from S d ⊂ Rd+1 . For example, for
d = 2, Figure B1 displays log t vs log |tX| for |X| = 625, 1250, 2500, 5000. For small t, the curves are identical
for different number of samples, yet as t increases and |tX| grows, the finite size effect takes hold and the
curves plateau at |tX| → |X|. In Table B1, we show the magnitude dimension estimates for d = 2, 4, 8, 16
and varying number of samples. Even for modest dimensions and high number of samples, the dimension
estimates are far below the actual dimension, as the finite size effect prevents the emergence of asymptotic
growth of the magnitude curve. The magnitude curves for higher dimensions are illustrated in Appendix B.

n 625 1250 2500 5000

S2 1.91 1.94 1.96 1.97
S4 2.92 3.15 3.33 3.47
S8 3.55 3.97 4.37 4.75
S 16 3.79 4.33 4.82 5.34

Table B1: Magnitude dimension estimates of the dimension of spheres S m where m = 2, 4, 8, 16, given
N uniform i.i.d. samples. We note that even for moderately high dimensions, the magnitude dimension
estimator can fail to recover the intrinsic dimension of the unit sphere even for N = 5000, where other
estimators succeed.

dt 2 |Xt|
|Xt| d2
104

1.0
103
0.8

102 0.6
5000
d = 1.97 0.4
2500
101 d = 1.96
1250 0.2
d = 1.94
625
100 d = 1.91 0.0
10 1 100 101 102 10 1 100 101 102

Figure B1: Magnitude functions of random samples of N = 625, 1250, 2500, 5000 points on S2 ⊂ R3 . We
observe that, as N increases, the finite cap on |tX| arrives at a larger value of t, and a larger part of the
linear region of the curve is preserved. We use the slope of the linear part of the curve as the magnitude
dimension estimate. On the right we plot the magnitude of the second derivative of the curve, approximated
by finite difference. The linear portion of the curve is selected to be the part of the curve whose second
derivative lies below the threshold value indicated in the right hand panel.

49
dt 2 |Xt|
|Xt| d2
104 4.0
3.5

103 3.0
2.5

102 2.0
5000
d = 3.47 1.5
2500
d = 3.33 1.0
101 1250
d = 3.15 0.5
625
d = 2.92 0.0
100 101 100 101

Figure B2: Magnitude functions of random samples of N = 625, 1250, 2500, 5000 points on S4 ⊂ R5 . We
observe that, as N increases, the finite cap on |tX| arrives at a larger value of t, and a larger part of the
linear region of the curve is preserved. We use the slope of the linear part of the curve as the magnitude
dimension estimate. On the right we plot the magnitude of the second derivative of the curve, approximated
by finite difference. The linear portion of the curve is selected to be the part of the curve whose second
derivative lies below the threshold value indicated in the right hand panel.

dt 2 |Xt|
|Xt| d2

104 12

10
103 5000
d = 5.34 8
2500
d = 4.82
102 1250 6
d = 4.33
625
d = 3.79 4
101
2

100 0
10 1 100 101 10 1 100 101

Figure B3: Magnitude functions of random samples of N = 625, 1250, 2500, 5000 points on S16 ⊂ R17 . We
observe that, as N increases, the finite cap on |tX| arrives at a larger value of t, and a larger part of the
linear region of the curve is preserved. We use the slope of the linear part of the curve as the magnitude
dimension estimate. On the right we plot the magnitude of the second derivative of the curve, approximated
by finite difference. The linear portion of the curve is selected to be the part of the curve whose second
derivative lies below the threshold value indicated in the right hand panel.

50
C Performance of Estimators on Benchmark Datasets
We assess the estimators with the following experimental procedure. Let M be the set of benchmark
manifolds, E the set of estimators, and HE the set of hyperparameters for some estimator E ∈ E. For
each triplet (M, E, H) representing a manifold, an estimator and a hyperparameter choice, we evaluated the
performance of the estimator over 20 randomly sampled point sets from M . We record the empirical mean,
ˆ M, H), and the standard deviation of the dimension estimates. For each dataset
which we denote by d(E,
type in the list of benchmark datasets, we varied the number of samples from 625, 1250, 2500, and 5000,
examining the performance as the number of points is successively doubled.
We varied the choice of hyperparameters over a range specified in Appendix D. Across the local estimators
we varied the number of nearest neighbours. We aggregate the performance across hyperparameters in the
tables in Appendix C, where we show dimension estimates on three different types of hyperparameters:
1. The hyperparameter that minimises the difference between the estimated and intrinsic dimension

dˆbest (E, M ) = min |d(E,

ˆ M, H) − d(M )|
H∈HE

2. The performance of the estimator with a fixed choice of hyperparameter; this is either

(a) The hyperparameter that minimises the median absolute error across the benchmark manifolds:

dˆabs (E, M ) = d(E,

ˆ M, Habs (E))
ˆ M, H) − d(M )|
Habs (E) = argmin median|d(E,
H∈HE M ∈M

(b) The one that minimises the median relative error across the benchmark manifolds:

dˆrel (E, M ) = d(E,

ˆ M, Hrel (E))
ˆ M, H) − d(M )|
|d(E,
Hrel (E) = argmin median
H∈HE M ∈M d(M )

The hyperparameter choices Habs and Hrel guarantee a reasonable performance of the estimator on most of
the datasets in the benchmark. While a choice of hyperparameter might be optimal for a certain dataset,
the same hyperparameter may lead to a poor performance on another. The difference between between dˆbest
and dˆabs or dˆrel gives an indication of how dependent the estimator is on the hyperparameter choice in order
to perform well on a particular dataset.

51
lPCA
n 625 1250 2500 5000
d n Best med abs med rel Best med abs med rel Best med abs med rel Best med abs med rel
M1 Sphere 10 11 10.0 (0.0) 9.8 (0.1) 11.0 (0.0) 10.0 (0.0) 10.8 (0.0) 11.0 (0.0) 10.0 (0.0) 11.0 (0.0) 11.0 (0.0) 10.0 (0.0) 10.0 (0.0) 10.0 (0.0)
M2 Affine 3 5 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0)
M3 Nonlinear 4 6 4.0 (0.0) 4.0 (0.1) 5.0 (0.0) 4.0 (0.0) 5.0 (0.0) 5.0 (0.0) 4.0 (0.0) 5.0 (0.0) 5.0 (0.0) 4.0 (0.0) 5.0 (0.0) 5.0 (0.0)
M4 Nonlinear 4 8 4.0 (0.0) 4.6 (0.2) 7.0 (0.0) 4.0 (0.0) 6.6 (0.0) 7.0 (0.0) 4.0 (0.0) 7.0 (0.0) 7.0 (0.0) 4.0 (0.0) 6.0 (0.0) 6.0 (0.0)
M5a Helix1d 1 3 1.0 (0.0) 1.1 (0.0) 3.0 (0.0) 1.0 (0.0) 1.4 (0.0) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0)
M5b Helix2d 2 3 2.0 (0.0) 2.9 (0.0) 3.0 (0.0) 2.0 (0.0) 3.0 (0.0) 3.0 (0.0) 2.0 (0.0) 3.0 (0.0) 3.0 (0.0) 2.0 (0.0) 3.0 (0.0) 3.0 (0.0)
M6 Nonlinear 6 36 6.0 (0.0) 6.0 (0.3) 11.0 (0.0) 6.0 (0.0) 11.0 (0.1) 11.0 (0.0) 6.0 (0.0) 12.0 (0.2) 12.0 (0.2) 6.0 (0.0) 11.0 (0.0) 11.0 (0.0)
M7 Roll 2 3 2.0 (0.0) 2.0 (0.0) 2.4 (0.5) 2.0 (0.0) 2.1 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0)
M8 Nonlinear 12 72 12.0 (0.0) 7.5 (0.4) 20.4 (0.5) 12.0 (0.0) 20.1 (0.1) 20.0 (0.0) 12.0 (0.0) 24.0 (0.2) 24.0 (0.2) 12.0 (0.0) 23.0 (0.0) 23.0 (0.0)
M9 Affine 20 20 20.0 (0.0) 10.4 (0.4) 19.0 (0.0) 20.0 (0.0) 19.3 (0.0) 19.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
M10a Cubic 10 11 10.0 (0.0) 9.0 (0.1) 11.0 (0.0) 10.0 (0.0) 11.0 (0.0) 11.0 (0.0) 10.0 (0.0) 11.0 (0.0) 11.0 (0.0) 10.0 (0.0) 11.0 (0.0) 11.0 (0.0)
M10b Cubic 17 18 16.5 (0.1) 11.0 (0.3) 18.0 (0.0) 16.4 (0.1) 17.8 (0.0) 18.0 (0.0) 17.6 (0.0) 18.0 (0.0) 18.0 (0.0) 17.6 (0.0) 18.0 (0.0) 18.0 (0.0)
M10c Cubic 24 25 24.0 (0.1) 11.6 (0.4) 23.0 (0.0) 24.0 (0.1) 22.6 (0.0) 23.0 (0.0) 23.9 (0.0) 25.0 (0.0) 25.0 (0.0) 23.7 (0.0) 25.0 (0.0) 25.0 (0.0)
M10d Cubic 70 72 55.2 (0.1) 11.6 (0.6) 37.0 (0.0) 55.2 (0.0) 37.4 (0.0) 37.0 (0.0) 55.2 (0.0) 54.2 (0.4) 54.2 (0.4) 55.2 (0.0) 55.0 (0.0) 55.0 (0.0)
M11 Moebius 2 3 2.0 (0.0) 2.4 (0.0) 3.0 (0.0) 2.0 (0.0) 2.9 (0.0) 3.0 (0.0) 2.0 (0.0) 3.0 (0.0) 3.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0)
M12 Norm 20 20 20.0 (0.0) 6.9 (0.3) 19.0 (0.0) 20.0 (0.0) 19.3 (0.0) 19.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
M13a Scurve 2 3 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0) 2.0 (0.0)
M13b Spiral 1 13 1.0 (0.0) 2.0 (0.0) 2.0 (0.0) 1.0 (0.0) 2.0 (0.0) 2.0 (0.0) 1.0 (0.0) 2.0 (0.0) 2.0 (0.0) 1.0 (0.0) 2.0 (0.0) 2.0 (0.0)
Mbeta 10 40 10.0 (0.0) 4.5 (0.2) 9.0 (0.0) 10.0 (0.0) 9.0 (0.1) 9.0 (0.0) 10.0 (0.0) 10.0 (0.0) 10.0 (0.0) 10.0 (0.0) 10.0 (0.0) 10.0 (0.0)
Mn1 Nonlinear 18 72 18.0 (0.0) 8.6 (0.5) 18.0 (0.0) 18.0 (0.0) 17.9 (0.0) 18.0 (0.0) 18.0 (0.0) 18.2 (0.4) 18.2 (0.4) 18.0 (0.0) 18.0 (0.0) 18.0 (0.0)
Mn2 Nonlinear 24 96 24.0 (0.0) 8.9 (0.5) 22.9 (0.3) 24.0 (0.0) 22.5 (0.0) 22.5 (0.5) 24.0 (0.0) 24.0 (0.0) 24.0 (0.0) 24.0 (0.0) 24.0 (0.0) 24.0 (0.0)

52
Mp1 Paraboloid 3 12 3.0 (0.0) 2.9 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0) 3.0 (0.0)
Mp2 Paraboloid 6 21 6.0 (0.0) 5.1 (0.1) 6.0 (0.0) 6.0 (0.0) 5.9 (0.0) 6.0 (0.0) 6.0 (0.0) 6.0 (0.0) 6.0 (0.0) 6.0 (0.0) 6.0 (0.0) 6.0 (0.0)
Mp3 Paraboloid 9 30 9.0 (0.0) 6.8 (0.2) 9.0 (0.0) 9.0 (0.0) 8.4 (0.0) 9.0 (0.0) 9.0 (0.0) 9.0 (0.0) 9.0 (0.0) 9.0 (0.0) 9.0 (0.0) 9.0 (0.0)

Table C1: Number of Samples. With few samples, estimates can be sensitive to hyperparameter choices, such as in the case of M10b,c,d Cubic, and
M12 Norm. However, the estimates become more accurate on more datasets as the number of samples increases. High dimensional datasets. With
many samples, the only high dimensional dataset that troubles lPCA is M10d Cubic, which it significantly underestimates. Variance. lPCA has the
lowest variance among all estimators. Note that the aggregation method over local dimension estimates is often the median in this table (see Table D2),
which tends to produce an integer value; this effectively ‘rounded’ aggregation also reduces the variance. Hyperparameter dependency. While
this is an issue if there are few samples for some datasets, for many samples, the estimates between the best performance and med rel, med abs are
close. Notable exceptions include some non-linear datasets, such as M3, M4, M6, and M8 Nonlinear; and M5b Helix2d. Nonlinear datasets. With
many point samples, lPCA performs well on Mn1,2 Nonlinear, and Mp1,2,3 Paraboloids, which many other estimators struggle with. However, some
estimators perform much more consistently with regards to hyperparameter sensitivity on the other nonlinear datasets M3, M4, M6, M8 Nonlinear,
M13b Spiral.
PH