Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
Abstract
Manifold learning (ML), known also as non-linear dimension reduction, is a set of meth-
arXiv:2311.03757v1 [stat.ML] 7 Nov 2023
ods to find the low dimensional structure of data. Dimension reduction for large, high
dimensional data is not merely a way to reduce the data; the new representations and
descriptors obtained by ML reveal the geometric shape of high dimensional point clouds,
and allow one to visualize, denoise and interpret them. This survey presents the principles
underlying ML, the representative methods, as well as their statistical foundations from
a practicing statistician’s perspective. It describes the trade-offs, and what theory tells us
about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
Contents
1 Introduction 2
2 Mathematical background 3
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Manifold and Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 (Riemannian) geometry on manifolds and isometric embedding . . . . . . . . . . 4
4 Embedding algorithms 9
4.1 Review: Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . 9
4.2 “One shot” embedding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.2 Diffusion Maps/Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . 11
4.2.3 Local Tangent Space Alignment (LTSA) . . . . . . . . . . . . . . . . . . . 12
4.3 “Horseshoe” effects, neighbor embedding algorithms, and selecting independent
eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3.1 Relaxation-based neighbor embedding algorithms . . . . . . . . . . . . . . 13
4.3.2 Avoiding the REP in spectral embeddings . . . . . . . . . . . . . . . . . . 14
4.4 Summary of embedding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
5 Statistical basis of manifold learning 16
5.1 Biases in ML. Effects of sampling density and graph construction . . . . . . . . . 18
5.2 Choosing the scale of neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Estimating the intrinsic dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Estimating the Laplace-Beltrami operator . . . . . . . . . . . . . . . . . . . . . . 21
5.5 Embedding distortions. Is isometric embedding possible? . . . . . . . . . . . . . . 22
7 Conclusion 26
1 Introduction
Modern data analysis tasks often face challenges of high dimension and thus nonlinear dimension
reduction techniques emerge as a way to construct maps from high dimensional data to their
corresponding low dimensional representations. Finding such low dimensional representations
of high dimensional data is beneficial in several aspects. This saves space and processing time.
More importantly, the low dimensional representation often provides a better understanding of
the intrinsic structure of data, which often leads to better features that can be fed into further
data analysis algorithms. This survey paper reviews the mathematical background, methodology,
and recent development of nonlinear dimension reduction techniques. These techniques have been
developed for two decades since two seminal works: Tenenbaum et al. (2000) and Roweis & Saul
(2000), and are widely used in various data analysis jobs, especially in scientific research.
Before nonlinear dimension reduction emerged, Principal Component Analysis (PCA) was
already widely accepted (I.T.Jolliffe, 2002). Intuitively, PCA assumes that high dimensional
data living in RD lie around a lower dimensional linear subspace of RD and seeks to find the best
linear subspace such that data points projected onto this subspace have minimal reconstruction
error. Nonlinear dimension reduction algorithms extend this idea by assuming data are supported
on smooth nonlinear low dimensional geometric objects, i.e., manifolds embedded in RD , and find
maps that send the samples into lower dimensional coordinates while preserving some intrinsic
geometric information.
In this survey, we start with a brief introduction to the main differential geometric concepts
underlying ML, elaborating on the geometric information that manifolds carry (Section 2). Then,
in Section 3, we describe the paradigm of manifold learning, with three possible sub-paradigms,
each producing a different representation of the data manifold. The rest of the paper focuses on
one of these, namely on the so-called embedding algorithms. In Section 4, we survey representa-
tive manifold learning algorithms and their variants. We also discuss the parameter choices, as
well as some pitfalls, which leads to the discussion in Section 5, where we present the statistical
aspects and statistical results supporting these methods. This section also includes the esti-
mation of crucial manifold descriptors from data: the Laplace-Beltrami operator, Riemannian
metrics, tangent space, intrinsic dimensions. Section 6 discusses applications, connecting with
related statistics problems, and Section 7 concludes the survey.
2
2 Mathematical background
2.1 Notations
In this survey, we use the notation RD to represent the D-dimensional Euclidean space. A
manifold is denoted as M. Lowercase Greek letters such as φ, ϕ, ψ, ρ, · · · represent functions
mapping from M to a subset of Euclidean space, while English letters like f, g, h, · · · denote
functions between real spaces. The notation C ℓ refers to the class of functions with continuous
derivatives up to order ℓ, and C ∞ represents the class of indefinitely differentiable functions.
The Kronecker delta is symbolized by δij .Bold lowercase English letters, like v and x, are used
to denote vectors in Euclidean space. A dataset containing n data points is represented as the
set D = {xi }ni=1 . Bold uppercase letters, such as A, B, C, denote matrices. By convention, we
treat a single data point as a column vector and use the matrix X ∈ Rn×D to represent the data
matrix of a dataset with n data points, each being a vector in D-dimensional Euclidean space,
while Y ∈ Rn×m will represent the same data mapped into m dimension by a manifold learning
algorithm. The notation ∥v∥ signifies the ℓ2 norm of vector v. Throughout this survey, we also
assume that all functions are smooth, i.e., continuously differentiable as many times as necessary.
3
(a) Swiss Roll (b) Torus
For a data scientist, the above means that, (1), they can work in the coordinate system of
their choice, and intrinsic quantities like d will remain invariant. But, (2), care must be taken
when the low dimensional data from two different algorithms, or from different samples are being
compared, because these may not be in the same coordinate system.
4
Figure 2: Left: The ethanol molecule has 9 atoms; a spatial configuration of ethanol has D = 3×9
dimensions. The CH3 group (atoms 2,6,7,8) and the OH group (atoms 3,9) can rotate w.r.t. the
middle group (atoms 1,4,5), and the blue and orange lines represent these angles of rotation.
Right: A 2-manifold estimated from 50,000 configurations of the ethanol molecule. The manifold
has the topology of a torus, and the color represents the rotation of the OH group. The sharp
“corners” are distortions introduced by the embedding algorithm (explained in Section 5.1).
Figure 6 shows the original data. This dataset is from Chmiela et al. (2017)
basis, and invariant if Rd is a subspace of a larger Euclidean space, distances along curves in
a manifold M can be defined solely based on the coordinate charts (U, ϕ), hence intrinsically,
without reference to the ambient space RD , and are purely geometric, hence are independent of
the choices of charts. This is achieved through Riemannian geometry, as follows.
In Rd , the scalar product ⟨v, u⟩ = v T u is sufficient to define both distances, by ∥v − u∥2 =
⟨v − u, v − u⟩, and angles, by ∠(v, u) = cos−1 (⟨v, u⟩/(∥v∥∥u∥)). Moreover, any positive definite
matrix A ∈ Rd×d can induce an inner product by ⟨v, u⟩A = v T Au; in this context, A is often
called a metric on Rd .
Riemann took up this idea and introduced the Riemannian metric, which plays the same role
as A above; however, this metric is allowed to vary from point to point, smoothly.
for all vectors in Tp M. More importantly, infinitesimal quantities such as the line element
Pd p
dl = i,j=1 gij dxi dxj and volume element dV = det(g)dx1 · · · dxd are also expressed through
the Riemannian metric, allowing one to define lengths of curves and volumes of subsets of M
as integrals. These integrals are invariant to the choice of bases in T M, hence to the choice of
5
Table 1: Three main paradigms for non-linear dimension reduction
Paradigm Representation
F
Linear local xi ∈ U ⊂ M → vi ∈ T̂p M ∼
= Rd D → d, local coordinates only
F
Principal Curves xi ∈ RD → x′i ∈ M ⊂ RD D → D, global coordinates, noise removal
and Surfaces
F
Embedding xi ∈⊂ M → yi ∈ F (M) ⊂ Rm D → m, with m ≥ d, global coordinates (or charts)
coordinate charts on M.
for all p ∈ M and v1 , v2 ∈ Tp M, ⟨v1 , v2 ⟩g(p) = ⟨dFp (v1 ), dFp (v2 )⟩h(F (p)) (1)
An isometry F preserves geometry quantities such as angles, distances, path lengths, volumes
etc. An embedding that is also an isometry is called an isometric embedding.
Ideally, we would like a manifold learning algorithm to produce an embedding F into Rm that
is isometric. Here we face one of the most remarkable gaps between mathematical theory and
methodology in manifold learning. Although it was long proved (Nash embedding theorem (Lee,
2003)) that isometric embedding is possible, no known practical algorithm capable of isometric
embedding exists at this time (details and refinement of this statement are in Section 5.5).
However, by estimating auxiliary information, working with a non-isometric embedding as if it
was isometric is still possible (see Section 5.5).
6
The Manifold Assumption itself is testable. For example in Fefferman et al. (2016), tests
whether, given an i.i.d. sample, there exists a manifold M that can approximate this sample
with tolerance ε. These results are currently not practically useful, as knowledge of usually
unknown parameters of the manifold (d, reach, volume) must be known or estimated. However,
they, as well as Genovese et al. (2012), give us the confidence to develop and use ML algorithms
in practice.
The kernel function here is almost universally the Gaussian kernel, defined as K(u) =
exp(−u2 ) (Belkin et al. (2006), Ting et al. (2010), Coifman & Lafon (2006), Singer & Wu
(2012),etc) . In the above, h, the kernel width, is another hyperparameter that must be tuned.
Note that, even if Ni would trivially contain all the data, the similarity Kij vanishes for far-
away data points. Therefore, equation (2) effectively defines a radius-neighbor graph with r ∝ h.
Hence, a rule of thumb is to select r to be a small multiple of h (e.g., 3–10h).
It is sometimes also useful to have kernel function K(u) = 1. Then the similarity matrix K
is the same as the unweighted adjacency matrix of the radius neighbor graph. By construction,
K is usually a sparse matrix, which is useful to accelerate the computation.
7
If the data is so that D is large, which is often the case, and n is large too, which is necessary
for manifold recovery, constructing the neighborhood graph can be the most computationally
demanding step of the algorithm. Fortunately, much work has been devoted to speeding up this
task, and approximate algorithms are now available, which can run in almost linear time in n
(computation complexity can be reduced to O(n1+δ ), where δ < 1 is a positive constant) and
have very good accuracy (Ram et al. (2009)).
Next, we briefly describe two of the paradigms in Table 1, then, from Section 4 on focus on
the third, embedding algorithms.
The weights can be computed using the kernel K() and bandwidth h used for the neighborhood
graph, for example. This weighted PCA procedure is sometimes termed local PCA (lPCA).
Local PCA is only used to understand geometric structure near any reference point, usually
one of the points in D. One can map all the data by performing lPCA with a subset of the
data points so that each xi is sufficiently well approximated by its projection. This way, all the
data are represented in d dimensional coordinates. However, some points can have more than
one representation, if they are close to 2 or more reference points. All the representations (i.e.,
coordinates) are local, and understanding or making inferences on the entire manifold are tedious
at best. . This approach can be refined into a multiscale local linear approach, which is more
accurate and parsimonious (Chen et al., 2013).
8
maximum, but in between, the ridges are manifolds (Chen et al., 2015) if the density fx is smooth
enough. This concept can be extended to principal surfaces, and principal d-manifolds.
The ridge can be estimated with the Subspace Constrained Mean Shift (SCMS) algorithm
(for more details, we refer the reader to Ozertem & Erdogmus (2011)). The SCMS algorithm
maps each xi to a point yi ∈ RD lying on the principal curve (or d-manifold). Hence, as a
manifold estimation algorithm, this method does not reduce dimension. However, unlike local
linear maps, the output is in a global coordinate system, i.e. RD .
Usually, the ridge does not coincide with the mean of the data; the bias depends on the
curvature of the manifold: the density is higher on the “inside” of the curve. However, for
their smoothing property, principal d-manifolds are remarkably useful in the analysis of manifold
estimation in noise (Genovese et al., 2012, Mohammed & Narayanan, 2017).
We have quickly reviewed two simple methods for manifold estimation: local linear approxi-
mation reduces the dimension locally, but offers no global representation, while principal curves
produce a global representation but do not reduce dimension. It is time to focus on the third class,
of algorithms that produce embeddings, representations that are both global and low dimension.
4 Embedding algorithms
The term ”manifold learning” was proposed in the seminal work of two algorithms: LLE(Roweis
& Saul (2000)) and Isomap(Tenenbaum et al. (2000)), together inaugurated the modern era
of non-linear dimension reduction. In this subsection, we introduce classical manifold learning
algorithms that (attempt to) find a global embedding Y ∈ Rn×m of data set D.
We separate the algorithms, roughly, into “one-shot” algorithms, which obtain embedding co-
ordinates from the principal eigenvectors of some matrix derived from the neighborhood graph, or
by solving some other global (usually convex) optimization problem, and “attraction-repulsion”
algorithms, which proceed from an initial embedding Y (often produced by a one-shot algorithm)
and improve it iteratively. While this taxonomy can rightly be called superficial, at present, it
represents a succinct and relatively accurate summary of the state of the art.
No matter what the approach, given the neighborhood information summarized in the weighted
neighborhood graph, an embedding algorithm’s task is to produce a smooth mapping of the in-
puts which distorts the neighborhood information as little as possible. The algorithms that
follow differ in their choice of information to preserve, and in the sometimes implicit constraints
on smoothness.
9
Then we can write PCA problem as
n
X
min ||xi − TT⊤ xi ||2 = min ||X − XTT⊤ ||2F (4)
T:T∈RD×d ,T⊤ T=Id T:T∈RD×d ,T⊤ T=Id
i=1
Consider the singular value decomposition preserving only the first d singular values of X =
UΣV⊤ where U ∈ Rn×d , V ∈ RD×d are orthogonal matrices and Σ is d × d diagonal matrix,
then solution of this problem is T = V. Low dimensional representation of original data is
Y = XT = UΣ, these are also called principal components. In the terminology of PCA,
columns of V are called principal vectors. They characterize the directions that explain the
variance in the data.
When data xi are centered, the (unnormalized) sample covariance matrix of data is C =
X⊤ X. The solution to PCA can also be found by eigendecomposition of C. The first d eigen-
vectors of C is just matrix V. If the dimension D ≫ n, it will be easier to first compute the
Gram matrix C = X⊤ X and then perform a truncated eigen-decomposition C = VΣ2 V⊤ . Low
dimensional representation is still XV.
Algorithm 1 Isomap
Input: : Neighborhood distance matrix A, embedding dimension m
1: Compute shortest path distance matrix Ãij ::
(
Aij Aij < ∞ ,
Ãij =
shortest path distance between i, j Aij = ∞ .
Intuitively, the shortest graph distance is a good approximation to the geodesic distance in
a neighborhood provided that data are sufficiently dense in this region and neighborhood size
is appropriately chosen (Bernstein et al., 2000). In the limit of large n, Isomap was shown
to produce isometric embeddings for m = d, whenever the data manifold is flat, i.e. admits an
isometric embedding in Rd , and data space is convex. Empirically, Isomap embeddings are
close to isometric also when m > d and m is sufficient for isometric embedding.
Computation complexity of Isomap is O(n3 ), with a most computational burden for com-
puting all pairs of shortest path distance. Space complexity is O(n2 ). Since Isomap works with
dense matrices, this space complexity cannot be improved.
There are variants of Isomap that improve it in different ways: Hessian Eigenmap (Donoho
& Grimes, 2003) enables non-convex data where they introduce the use of Hessian operator;
10
Continuum Isomap (Zha & Zhang, 2007) generalizes Isomap to a continuous version such that
out-of-sample extension of Isomap is possible.
Discard v 0 .
2:
Represent each xj by yj = (vj1 , · · · , vjm )⊤
3:
Output: Y
P
To construct a graph Laplacian matrix, let di = j∈Ni Kij represent the degree of node i
and D = diag{d1 , · · · , dn }. Then multiple choices of graph Laplacian exist:
Output: L = (I − P)/h2
Why one Laplacian rather than another? The reason is that, even though in many simple
examples the difference is hard to spot, one needs to ensure that, as more sample sizes are
collected, the limit of these L’s is well defined, and the embedding algorithm is unbiased. It
is easy to see that Lnorm and Lrw are similar matrices. Moreover, whenever the degrees di
are constant, L = Lrw ∝ Lun , hence all Laplacians should produce the same embedding. The
difference appears when the data density is non-uniform, making one xi be surrounded more
densely by other data points. The seminal work Coifman & Lafon (2006), which introduced
renormalization, showed that in this case, the eigenvectors of Lnorm , Lrw are biased by the
sampling density and that renormalization removes this bias. Sections 5.4, and Figure 6 illustrate
this.
11
The idea of spectral embedding appeared in (Shi & Malik, 2000, Belkin & Niyogi, 2002)
as a trick for clustering, and was then generalized as a data representation method in Belkin
& Niyogi (2003) as LE. They connect the graph Laplacian with the famous Laplace-Beltrami
operator ∆M of manifold, a differential operator that plays an an important role in modern
differential geometry (Rosenberg, 1997). Estimating the Laplace-Beltrami operator itself is an
important geometric estimation problem that will be reviewed in Section 5.4.
Because L is sparse, DM/LE are computationally less challenging when compared with
Isomap.
Where x, Q are translation and rotation that parametrize this affine transformation. θj is a
local coordinate of each neighbor xj projected on this linear subspace.
In the second stage of LTSA, one obtains global embedding coordinates Y while θj that
preserves local geometry information, through minimizing a global reconstruction error
n X
(i)
X
min ∥yj − ỹi − Pi θj ∥2 (6)
{yi }i=1 ,{Pi }n
n
i=1
i=1 j∈Ni
The optimization in both steps can be transformed into eigenvalue problems. Hence the
algorithmic procedure of LTSA is displayed in algorithm 4
We have seen three embedding algorithms so far in this section: Isomap,LE,LTSA, which
all come with a certain performance guarantee. On the other hand, locally linear embedding
(LLE, Roweis & Saul (2000)) is a heuristic method that utilizes a very similar idea: estimate
local representations first and then align them globally. However, vanilla LLE does not perform
well empirically and lacks theoretical guarantees (Ting et al., 2010, Ting & Jordan, 2018).
12
4.3 “Horseshoe” effects, neighbor embedding algorithms, and selecting
independent eigenvectors
Algorithms that use eigenvectors, such as DM, are among the most promising and well-studied
in ML (see Sections 5.1,5.2,5.4). Unfortunately, such algorithms fail when the data manifold has
a large aspect ratio (such as a long, thin strip, or a thin torus). This problem has been called
the Repeated Eigendirection Problem (REP) and has been demonstrated for LLE, LE, LTSA,
HE (Goldberg et al., 2008), and is pervasive in real data sets.
From a differential geometric standpoint, the REP is a drop in the rank of the embedding
Jacobian, due to eigenvectors (or eigenfunctions, in the limit) that are harmonics of previous ones,
as shown in Figure 3. For example, for a rectangular strip (and a finite sample), the scatterplot of
(hvi1 , hv 2 )i=1,...n from step 3 of the DM follows a parabola; hence, it is a 1-dimensional mapping,
even though the rectangle is 2-dimensional. This fact is a relevant diagnosis for REP in practice:
when an embedding looks like a “horseshoe”, this may not represent a property of the data,
but an artifact signalling that one of the data dimensions is collapsed, or poorly reflected in the
embedding (Diaconis et al., 2008).
13
Uniform manifold approximation and projection (UMAP, McInnes et al. (2018)) is another
popular heuristic method. On a high level, UMAP minimizes the mismatches between topological
representations of high-dimensional data set {xi }ni=1 and its low-dimensional embeddings yi .
Theories of UMAP are still very limited.
t-SNE has the advantage of being sensitive to local structure and to clusters in data (Lin-
derman & Steinerberger, 2019, Kobak et al., 2020) (but does not explicitly preserve the global
structure). We note that this propensity for finding clusters comes partly from the choice of
neighborhood graph (Section 5.1). However, this is not the whole story. Recently, it has been
shown that this property stems from the gradient of the loss function Lt−SN E , which has the
form
∂Lt−SN E X n X Wij
= − Vij Wij (yi − yj ) + (yi − yj ). (8)
∂yi ij
ρ ij wtot
In the above, the first term is an attraction between graph neighbors, while the second represents
repulsive forces between the embedded points y1:n (Böhm et al., 2022, Zhang et al., 2022).
Note the additional important parameter ρ, which controls the trade-off between attraction and
1/ρ
repulsion (ρ corresponds to a version of the cost with wtot in the second term). In Böhm et al.
(2022) it is shown that varying ρ from small to large values ρ decreases the cluster separation, and
makes the embedding more similar to the LE embedding. Moreover, quite surprisingly, Böhm
et al. (2022) shows that by varying ρ, the t-SNE can emulate a variety of other algorithms, most
notably.
UMAP (McInnes et al., 2018) and ForceAtlas (Jacomy et al., 2014).. Other works that
analyze the attraction-repulsion behavior of t-SNE are Zhang & Steinerberger (2021). One yet
unsolved issue with t-SNE is the choice of the number of neighbors k. Most applications use
the default k = 90 (Poličar et al., 2019); this choice, as well as other behaviors of this class of
algorithms, are discussed in Zhang et al. (2022).
Note also that since the REP can be interpreted as extreme distortion, the Riemannian-
Relaxation (Perrault-Joncas & Meila (2014) in Section 5.5) can also, be used to improve the
conditioning of an embedding in an iterative manner.
Finally, in Minimum Variance Unfolding (MVU) , proposed in Weinberger & Saul (2006),
Arias-Castro & Pelletier (2013), repulsion is implemented via a Semidefinite Program, hence the
embedding Y is obtained by solving a convex optimization. This algorithm can be seen both as
a one-shot and as an attraction-repulsion algorithm; Diaconis et al. (2008) show that MVU is
related to the fastest mixing Markov chain on the neighborhood graph.
14
Figure 3: Embedding algorithms failing to find a full rank mapping, if they greedily select the
first m = 2 eigenvectors, and correction by a more refined choice of eigenvectors. Top row:
Embeddings of galaxy spectra from the SDSS (Section 6) by DM ; middle “horseshoe” when
first 2 eigenvectors are used; right the same data, with selection of the second eigenvector (in
this case by Chen & Meila (2021)). Bottom row: embeddings of a swiss roll with length 7 times
the width. Left: first 2 eigenvectors from DM/LE; middle after UMAP. Note that UMAP by
itself is not able to produce a full-rank embedding everywhere; the horseshoe, the two clusters,
and the 1 dimensional “filament” between are all artifacts. Right: UMAP with selection of the
second eigenvector by Chen & Meila (2021). Plots by Yu-Chia Chen.
15
avoids REP, but it is a first step towards the algorithmic use of charts and atlases to complement
global embeddings.
In summary, attraction-repulsion algorithms such as t-SNE, which are heuristic, enjoy large
popularity due in part to their immunity to the REP, while eigenvector based methods, although
better grounded in theory, are less useful in practice without post-processing by an IES method.
On the other hand, unlike global search in eigenvector space, a local relaxation algorithm cannot
resolve the rank deficiency globally, and it may become trapped in a local optimum (Figure 3).
16
(a) Isomap (b) LE (c) LLE
Figure 5: Embedding obtained from the algorithms in this section on a chopped torus data set
with n = 14, 519 points. This manifold cannot be embedded isometrically in d = 2 dimensions.
17
Here we discuss in more general terms what is known about graph construction methods
(Section 5.1), the neighborhood scale (Section 5.2), and the intrinsic dimension (Section 5.3).
We revisit the estimation of the Laplacian (a normalized version of the neighborhood graph),
as the natural representation of the manifold geometry, and the basis for the Diffusion Maps
embedding, which can be seen as the archetypal embedding (Section 5.4). Finally, in Section
5.5, we turn to mitigate the distortions that the embedding algorithm currently induces.
Theoretical results/Asymptotic results and what they mean The asymptotic results
of Giné & Koltchinskii (2006), Hein et al. (2007), Ting et al. (2010) and Singer (2006) provide
the necessary rates of change for h with respect to n to guarantee convergence of the respective
estimate. For instance, Singer (2006) proves that the optimal bandwidth parameter for Laplacian
1
estimation is given by h ∼ n− d+6 using a random-walk Laplacian. For the k-neareast neighbor
graph, Calder & Trillos (2019) show that again for Laplacian estimation, the number of neighbors
4 d
k must grow slowly with n, and a recommended rate is k ∼ n d+4 (log n) d+4 . The hidden constant
factor in these results is not completely known, but they depend on the manifold volume, the
curvature, and the injectivity radius τ (typically not known in practice).
With these rate-wise optimal selections of k or r, the convergence rate of estimation relating
to various objects on the manifold can be established. However, all are non-parametric rates.
More specifically, they point to the fact that the sample size n must grow exponentially with
18
Figure 6: Effects of graph construction and renormalization, when the sampling density is highly
non-uniform, exemplified on the configurations of the ethanol molecule. Left: original data, after
preprocessing, is a noisy torus, with three regions of high density, around local minima of the
potential energy. Center: Embeddings by DM (purple), and by the same algorithm with L
constructed from the k-nearest neighbor graph (yellow). The low sparse regions are stretched,
while the dense regions appear like “corners” of the embedding. Note that DM should remove
the effects of the density; in this case, the variations in density are so extreme that the effect
persists. The effect is somewhat stronger for the k-nearest neighbor graph. Right: Embedding
by DM (purple) and by LE (yellow), which uses the singly normalized Lrw .
19
at some point xi . The method is specific to the estimation of the Laplace-Beltrami operator,
but in this context, it can be extended to optimization over other parameters, such as kernel
smoothness.
Finally, we mention a dimension estimation algorithm proposed in Chen et al. (2013), a by-
product of this algorithm is a range of scales where the tangent space at a data point is well
aligned with the principal subspace obtained by a local singular value decomposition. As these
are scales at which the manifold looks locally linear, one can reasonably expect that they are
also the correct scales at which to approximate the manifold.
Principles and methods for estimating d An idea that appears in various forms through
the dimension estimation literature is to find a local statistic that scales with d by a known law.
For example, the volume of a ball Br of radius r contained in a manifold M is proportional to
rd . If we take n samples uniformly from M, the number of samples contained in Br , denoted
#Br is proportional to nrd , or equivalently
This suggests if we fit a line to (log r, log #Br ), the slope of the line would represent d.
Recall that ki,r represents the number of radius r neighbors of data point xi . Hence log ki,r ≈
d log r + constant. This is the idea of Grassberger & Procaccia (1983) who introduced the
correlation dimension estimator given by
log n1 i̸=i′ 1∥xi −xi′ ∥≤r
P
dˆC = lim (10)
r→0 log r
In the above, n1 i̸=i′ 1∥xi −xi′ ∥≤r , where the sum is taken over unordered pairs, is nothing else
P
1
Pn
but 2n i=1 (kir −1); hence, the correlation dimension uses an average number of neighbors. This
estimator is easily computed for the radius neighborhood graph. To sidestep the inconvenient
k
assumption that the sample is uniform over M, other methods consider statistics such as ki,2r i,r
≈
d
2 , which lead to so-called doubling dimensions. (Assouad, 1983)
20
Similarly, the covering number ν(r), representing the minimum number of boxes (in this
instance) of size r needed to cover a manifold, scales like r−d for r small. The Box Counting
dimension (Falconer, 2003a) of an object is defined as
ln ν(r)
dˆBC = lim . (11)
r→0 ln 1
r
If ν(r) is defined by way of balls the above becomes the well-known Haussdorff dimension
(Falconer, 2003b). When dˆBC is estimated from data, the covering number represents the number
of boxes (balls) to cover the data set. Note that for finite n, r cannot become too small, as in
this case, every ball or box will contain a single point. The finite radius r is a scale parameter
trading off bias (which increases with r), and variance (which decreases with r).
All the above estimates converge to the intrinsic dimension d when the data is sampled
from a d-manifold. In practice, and with noise, their properties differ, as well as the amount of
computation they need.
Modern estimators consider other statistics, such as distance to k-th nearest neighbor (Pettis
et al., 1979, Costa et al., 2005), the volume of a spherical cap (Kleindessner & von Luxburg,
2015) (both statistics can be computed without knowing actual distances, just comparisons
between them), or Wasserstein distance between two samples of size n on M, which scales like
n−1/d (Block et al., 2022); the algorithm of Levina & Bickel (2004), analyzed in Farahmand et al.
(2007), proposes a Maximum Likelihood method based on k-nearest neighbor graphs.
An algorithm for dimension estimation in noise is proposed by Chen et al. (2013). This
algorithm is based on local PCA at multiple scales; here, dˆL is the most frequent index of
the maximum eigengap of the local covariance matrix. The main challenge is to establish the
appropriate range of scales r at which the d principal values of the local covariances separate
from the remaining eigenvalues, in noise. The algorithm can be simplified by plugging in the
neighborhood radius selected to optimize the Laplacian estimator, by e.g. Joncas et al. (2017),
see Section 5.2.
21
n→∞
• Pointwise convergence: E[(D−1 Lun f )i ] −→ cL∞ f (xi ) + O(h2 )
f (x)(L∞ f )(x)q(x)dµ(x)
R
un n→∞
• Spectral convergence: E[ ffLDff ] −→ c x∈M R
f 2 (x)q(x)dµ(x)
+ O(h2 ); this type of
x∈M
convergence matters for spectral embedding algorithms
When the sampling density q is uniform, Belkin & Niyogi (2007) showed that pointwise con-
vergence of random-walk Laplacian holds for L∞ = ∆M from an h−nearest neighbor graph.
Coifman & Lafon (2006) showed that pointwise convergence holds for L∞ = ∆M − ∆M q/q when
q is not uniform. Through the renormalization, L as in algorithm 3 can eliminate the bias term
∆q/q and converge to Laplace-Beltrami operator regardless of sampling density. Ting et al.
(2010) further showed that for K−nearest neighbor graph, the random-walk graph Laplacian
pointwisely converge to ∆M rescaled with q 2/d . For spectral convergence, readers are encour-
aged to consult Belkin & Niyogi (2007), Berry & Sauer (2019), Garcı́a Trillos & Slepčev (2018),
Garcı́a Trillos et al. (2020).
Recently, the limits of a class of manifold learning algorithms to differential operators are
studied. For a specific type called linear smoothing algorithms, these ML algorithms are proved
to converge to a second-order differential operator on M. For example, LE ,DM converges to
Laplacian operator, LTSA,Hessian Eigemaps both converge to the Frobenius norm of Hessian.
Unregularized LLE, on the other hand, fails to converge to any differential operator. Details can
be found in Ting & Jordan (2018).
22
(2013) approach global isometry by means of constructing normal coordinates recursively from a
point p ∈ M, or, respectively, by mutually orthogonal parallel vector fields, and Verma (2011) is
the first attempt to implement Nash’s construction. We note that, with the exception of Verma
(2011), these methods do not guarantee an isometric embedding except in limited special cases.
In the above, dFp† is the pseudoinverse of dFp . In matrix notation (12) implies that
†
g̃(Fp ) ≡ ((dFp )T )† g(p) (dFp ) (13)
with dFp , g(p), g̃(p) are matrices of size d × m, d × d and m × m respectively, and g(p), g̃(p)
positive semidefinite matrices of rank d. When M ⊂ RD , with metric inherited from the ambient
space, g(p) = Id the unit matrix and g̃(p) = ((dFp )T )† (dFp )† . Comparing (12) with (1) it is
easy to see that (F (M), g̃) is isometric with the original (M, g).
Hence, if one computes for each embedding point yi the respective pushforward metric g̃i ∈
Rm×m , then all geometric quantities computed with the points y1 , . . . yn w.r.t. g̃ would preserve
their values in the original data, subject only to sampling noise.
It remains to see how to estimate g̃. A direct way is via (13), using an estimator of dF (p).
Another method Perraul-Joncas & Meila (2013) is via the Laplace-Beltrami operator ∆M , namely
using the Diffusion Maps Laplacian, whose properties and consistency is well studied, as seen in
Section 5.4. To extract g̃, Perraul-Joncas & Meila (2013) applies ∆M to a suitably chosen set
of test functions fkl , with 1 ≤ k ≤ l ≤ m, where fkl,p = F k − F k (p) F l − F l (p) are pairwise
products of coordinate functions, centered at point p. They show that 21 ∆M fkl,p |(p) = h̃kl (p),
the k, l entry in the inverse pushforward metric at p (algorithmically, on a sample, this operation
can be easily vectorized). To obtain gi the metric at data point i, one computes the rank
d pseudo-inverse (Adi Ben-Israel, 2003) of h̃i by SVD. Note that h̃i itself measures the local
distortion at data point i. The embedding metric g̃ and its SVD offer other insights into
the embedding. For instance, the singular values of g̃ may offer a window into estimating d
by looking for a “singular value gap”. The d singular vectors form an orthonormal basis of the
tangent space Tp F (M) at point p = yi , providing a natural framework for constructing a normal
coordinate chart around p. The non-zero singular values of h̃i yield a measure of the distortion
induced by the embedding around the data point xi (indeed, if the embedding were isometric
to M with the metric inherited from Rm , then the embedding metric g̃ would have exactly d
singular values equal to 1).
This last remark can be used in many ways, such as getting a global distortion for the
embedding, and hence as a tool to compare various embeddings. It can also be used to define
23
an objective function to minimize in order to get a more isometric embedding; such as the
RiemannianRelaxation of McQueen et al. (2016).
Figure 7: The embeddings from Figure 5, with the distortion h̃ estimated at a random subset of
points.
24
will be K − 1 eigenvectors indicating the clustering, and for each cluster, additional eigenvectors
for a low-dimensional mapping of the data in the respective cluster. If fewer eigenvectors are
used, then usually the clusters will be recovered but not the intrinsic geometry inside each cluster.
25
Chemistry The accurate simulation of atomical and molecular systems plays a major role
in modern chemistry. Molecular Dynamics (MD) simulations from carefully designed, complex
quantic models can take millions of computer hours; however, simulations can still be less ex-
pensive than conducting experiments, and they return data at a level of detail not achievable
in most experiments. Manifold learning is used to discover collective coordinates, i.e. low di-
mensional descriptors that approximate well the larger scale behavior of atomic, molecular, and
other large particle systems (Boninsegna et al., 2015, A. et al., 2012, Noé & Clementi, 2017). In
these examples, the systems can be in equilibrium, or evolving in time, and in the latter case,
the collective coordinates describe the saddle points in the trajectory, or the folding mechanism
of a large molecule (Rohrdanz et al., 2011, Das et al., 2006).
Manifold embedding is also used to create low dimensional maps of families of molecules and
materials by the similarity of their properties (Ceriotti et al., 2013, Isayev et al., 2015).
Biological sciences In neuroscience and the biological sciences, manifold embeddings are
widely used to summarize neural recordings (Connor & Rozell, 2016, Cunningham & Yu, 2014),
to describe cell evolution (Herring et al., 2018)
7 Conclusion
In practice, ML is overwhelmingly used for visualization (Section6) and with small data sets.
But ML can do much more. Efficient software now exists (McQueen et al. (2016), Poličar
et al. (2019),etc) which can embed truly large, high-dimensional data (for example SDSS). In
these cases, ML helps practitioners understand the data, by e.g. its intrinsic dimension, or by
interpreting the manifold coordinates (Koelle et al., 2022, Boninsegna et al., 2015, Vanderplas
& Connolly, 2009). For real data, a manifold learning algorithm has the effect of smoothing the
data and supressing/removing variation orthogonal to the manifold, which can be regarded as
noise, just like in PCA. Finally, again similarly to PCA, ML can effectively reduce the data to
m ≪ D dimensions, while preserving features predictive for future statistical inferences. Some
inferences, such as regression, can be performed on manifold data without manifold estimation, by
for example, local linear regression (Aswani et al., 2011), or via Gaussian Processes (Borovitskiy
et al., 2020). A GP on a manifold can be naturally defined via the Laplacian ∆M .
Even when only visualization is desired, care must be taken to ensure the reproducibility of
the results. The implicit assumption that m = 2 is sufficient for embedding the data should be
validated. Attention to be paid when the embedding is interpreted: are the features observed
really in the data or artifacts of the algorithm?
What we omitted We surveyed the state-of-the-art knowledge on the main problems and
methods of manifold learning, focusing on the algorithms that are proven to recover the manifold
structure through learning a smooth embedding.
Among the topics we had to leave out, manifold learning in noise is perhaps the most im-
portant one. Noise makes ML significantly more difficult, by introducing biases and slowing the
convergence of estimators. This is an active area of research, but the estimation of geometric
quantities like tangent space and reach in the presence of noise have been studied (Aamari &
Levrard (2018, 2019),etc); the theoretical results of manifold recovery in noise were mentioned
in Section 3.
The reach, or injectivity radius τ (M) of manifold measures how close to itself M can be. In
other words, τ (M) is the largest radius a ball can have, so that, for any p ∈ M, if it is tangent
to the manifold in p, it does not intersect M in any other point. Large τ implies larger curvature
26
(a subspace has infinite τ ) and easier estimation of M (Genovese et al., 2012, Fefferman et al.,
2016, Aamari & Levrard, 2018, 2019). A manifold can have borders; ML with borders is studied
for example in Singer & Wu (2012), different convergence rates appear when data are sampled
close to the border.
Another useful task is embedding a new data point x ∈ RD onto an existing embedding
F (M); this is often called Nystrom embedding (e.g. Chatalic et al. (2022)). Conversely, if
y ∈ Rm is a new point on the embedding F (M), obtained e.g. by following a curve in the
low dimensional representation of M, how do we map it back to RD ? This is usually done by
interpolation.
References
A. TG, Ceriotti M, Parrinello M. 2012. Using sketch-map coordinates to analyze and bias molec-
ular dynamics simulations. Proceedings of the National Academy of Science, USA 109:5196—
-201
Aamari E, Levrard C. 2018. Stability and minimax optimality of tangential delaunay complexes
for manifold reconstruction. Discrete & Computational Geometry 59(4):923–971
Aamari E, Levrard C. 2019. Nonasymptotic rates for manifold, tangent space and curvature
estimation. Ann. Stat. 47(1):177–204
Adi Ben-Israel TNEG. 2003. Generalized inverses: Theory and applications. New York: Springer
New York, NY
Altan E, Solla SA, Miller LE, Perreault EJ. 2020. Estimating the dimensionality of the manifold
underlying multi-electrode neural recordings. bioRxiv
Arias-Castro E, Pelletier B. 2013. On the convergence of maximum variance unfolding. Journal
of Machine Learning Research 14:1747–1770
Baraniuk RG, Wakin MB. 2009. Random projections of smooth manifolds. Foundations of Com-
putational Mathematics 9(1):51–77
Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clus-
tering, In Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press
Belkin M, Niyogi P. 2003. Laplacian eigenmaps for dimensionality reduction and data represen-
tation. Neural Computation 15(6):1373–1396
Belkin M, Niyogi P. 2007. Convergence of laplacian eigenmaps. In Advances in Neural Information
Processing Systems 19, eds. B Schölkopf, JC Platt, T Hoffman. MIT Press, 129–136
Belkin M, Niyogi P, Sindhwani V. 2006. Manifold regularization: A geometric framework
for learning from labeled and unlabeled examples. Journal of Machine Learning Research
7(85):2399–2434
27
Bérard P, Besson G, Gallot S. 1994. Embedding Riemannian manifolds by their heat kernel.
Geometric Functional Analysis 4(4):373–398
Bernstein M, de Silva V, Langford JC, Tennenbaum J. 2000. Graph approximations to geodesics
on embedded manifolds. http://web.mit.edu/cocosci/isomap/BdSLT.pdf
Berry T, Harlim J. 2016. Variable bandwidth diffusion kernels. Applied and Computational Har-
monic Analysis 40(1):68–96
Berry T, Sauer T. 2019. Consistent manifold representation for topological data analysis
Block A, Jia Z, Polyanskiy Y, Rakhlin A. 2022. Intrinsic dimension estimation using wasserstein
distance. Journal of Machine Learning Research 23(313):1–37
28
Chmiela S, Tkatchenko A, Sauceda H, Poltavsky I, Schütt KT, Müller KR. 2017. Machine learn-
ing of accurate energy-conserving molecular force fields. Science Advances
Coifman RR, Lafon S. 2006. Diffusion maps. Applied and Computational Harmonic Analysis
30(1):5–30
Coifman RR, Lafon S, Lee A, Maggioni, Warner, Zucker. 2005. Geometric diffusions as a tool
for harmonic analysis and structure definition of data: Diffusion maps, In Proceedings of the
National Academy of Sciences, pp. 7426–7431
Connor M, Rozell C. 2016. Unsupervised learning of manifold models for neural coding of phys-
ical transformations in the ventral visual pathway, In Neural Information Processing Systems
(NIPS) Workshop, Brains and Bits: Neuroscience Meets Machine Learning. Barcelona, Spain
Costa J, Girotra A, Hero A. 2005. Estimating local intrinsic dimension with k-nearest neighbor
graphs, In IEEE/SP 13th Workshop on Statistical Signal Processing, 2005, pp. 417–422
Cunningham JP, Yu BM. 2014. Dimensionality reduction for large-scale neural recordings. Nature
Neuroscience 16:1500—-1509
Dsilva CJ, Talmon R, Coifman RR, Kevrekidis IG. 2018. Parsimonious representation of non-
linear dynamical systems through manifold learning: A chemotaxis case study. Appl. Comput.
Harmon. Anal. 44(3):759–773
Dsilva CJ, Talmon R, Gear CW, Coifman RR, Kevrekidis IG. 2016. Data-driven reduction
for a class of multiscale fast-slow stochastic dynamical systems. SIAM J. Appl. Dyn. Syst.
15(3):1327–1351
Falconer K. 2003a. Alternative definitions of dimension, chap. 3. John Wiley & Sons, Ltd, 39–58
Falconer K. 2003b. Hausdorff measure and dimension, chap. 2. John Wiley & Sons, Ltd, 27–38
Farahmand Am, Szepesvári C, Audibert JY. 2007. Manifold-adaptive dimension estimation, In
Proceedings of the 24th International Conference on Machine Learning, ICML ’07, p. 265–272,
New York, NY, USA: Association for Computing Machinery
Fefferman C, Mitter S, Narayanan H. 2016. Testing the manifold hypothesis. J. Amer. Math.
Soc. 29(4):983–1049
Garcı́a Trillos N, Gerlach M, Hein M, Slepčev D. 2020. Error estimates for spectral convergence
of the graph laplacian on random geometric graphs toward the laplace–beltrami operator.
Foundations of Computational Mathematics 20(4):827–887
29
Garcı́a Trillos N, Slepčev D. 2018. A variational approach to the consistency of spectral clustering.
Applied and Computational Harmonic Analysis 45(2):239–281
Genovese CR, Perone-Pacifico M, Verdinelli I, Wasserman LA. 2012. Minimax manifold estima-
tion. Journal of Machine Learning Research 13:1263–1291
Giné E, Koltchinskii V. 2006. Concentration inequalities and asymptotic results for ratio type
empirical processes. The Annals of Probability 34(3):1143 – 1216
Goldberg Y, Zakai A, Kushnir D, Ritov Y. 2008. Manifold learning: The price of normalization.
Journal of Machine Learning Research 9(63):1909–1939
Grassberger P, Procaccia I. 1983. Measuring the strangeness of strange attractors. Physica D:
Nonlinear Phenomena 9(1):189–208
Hastie T, Stuetzle W. 1989. Principal curves. Journal of the American Statistical Association
84(406):502–516
Hegde C, Wakin M, Baraniuk R. 2007. Random projections for manifold learning, In Advances
in Neural Information Processing Systems, eds. J Platt, D Koller, Y Singer, S Roweis, vol. 20.
Curran Associates, Inc.
Hein M, Audibert J, von Luxburg U. 2007. Graph laplacians and their convergence on random
neighborhood graphs. Journal of Machine Learning Research 8:1325–1368
Herring CA, Banerjee A, McKinley ET, Simmons AJ, Ping J, et al. 2018. Unsupervised trajectory
analysis of Single-Cell RNA-Seq and imaging data reveals alternative tuft cell origins in the
gut. Cell Syst 6(1):37–51.e9
Hinton GE, Roweis S. 2002. Stochastic neighbor embedding, In Advances in Neural Information
Processing Systems, eds. S Becker, S Thrun, K Obermayer, vol. 15. MIT Press
Joncas D, Meila M, McQueen J. 2017. Improved graph laplacian via geometric Self-Consistency.
In Advances in Neural Information Processing Systems 30, eds. I Guyon, UV Luxburg, S Ben-
gio, H Wallach, R Fergus, S Vishwanathan, R Garnett. Curran Associates, Inc., 4457–4466
Kim J, Rinaldo A, Wasserman LA. 2019. Minimax rates for estimating the dimension of a
manifold. J. Comput. Geom. 10(1):42–95
Kirichenko A, van Zanten H. 2017. Estimating a smooth function on a large graph by Bayesian
Laplacian regularisation. Electronic Journal of Statistics 11(1):891 – 915
30
Kleindessner M, von Luxburg U. 2015. Dimensionality estimation without distances, In AISTATS
Kobak D, Linderman G, Steinerberger S, Kluger Y, Berens P. 2020. Heavy-tailed kernels reveal
a finer cluster structure in t-sne visualisations, In Machine Learning and Knowledge Discovery
in Databases, eds. U Brefeld, E Fromont, A Hotho, A Knobbe, M Maathuis, C Robardet, pp.
124–139, Cham: Springer International Publishing
Koelle SJ, Zhang H, Meila M, Chen YC. 2022. Manifold coordinates with physical meaning.
Journal of Machine Learning Research 23(133):1–57
Kohli D, Cloninger A, Mishne G. 2021. Ldle: Low distortion local eigenmaps. Journal of Machine
Learning Research 22(282):1–64
Koltchinskii VI. 2000. Empirical geometry of multivariate data: a deconvolution approach. The
Annals of Statistics 28(2):591 – 629
Kruskal JB. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypoth-
esis. Psychometrika 29(1):1–27
31
Nadler B, Lafon S, Coifman R, Kevrekidis I. 2006. Diffusion maps, spectral clustering and eigen-
functions of Fokker-Planck operators, In Advances in Neural Information Processing Systems
18, eds. Y Weiss, B Schölkopf, J Platt, pp. 955–962, Cambridge, MA: MIT Press
Ng A, Jordan M, Weiss Y. 2001. On spectral clustering: Analysis and an algorithm, In Advances
in Neural Information Processing Systems, eds. T Dietterich, S Becker, Z Ghahramani, vol. 14.
MIT Press
Noé F, Clementi C. 2017. Collective variables for the study of long-time kinetics from molecular
trajectories: theory and methods. Current Opinion in Structural Biology 43:141–147
Ozertem U, Erdogmus D. 2011. Locally defined principal curves and surfaces. Journal of Machine
Learning Research 12(34):1249–1286
Perraul-Joncas D, Meila M. 2013. Non-linear dimensionality reduction: Riemannian metric esti-
mation and the problem of geometric discovery. ArXiv e-prints
Perrault-Joncas D, Meila M. 2014. Improved graph laplacian via geometric self-consistency. ArXiv
e-prints
Pettis KW, Bailey TA, Jain AK, Dubes RC. 1979. An intrinsic dimensionality estimator from
near-neighbor information. IEEE Transactions on Pattern Analysis and Machine Intelligence
PAMI-1(1):25–37
Poličar PG, Stražar M, Zupan B. 2019. opentsne: a modular python library for t-sne dimension-
ality reduction and embedding. bioRxiv
Portegies JW. 2016. Embeddings of Riemannian manifolds with heat kernels and eigenfunctions.
Communications on Pure and Applied Mathematics 69(3):478–518
Ram P, Lee D, March W, Gray A. 2009. Linear-time algorithms for pairwise statistical prob-
lems, In Advances in Neural Information Processing Systems, eds. Y Bengio, D Schuurmans,
J Lafferty, C Williams, A Culotta, vol. 22. Curran Associates, Inc.
Rohrdanz MA, Zheng W, Maggioni M, Clementi C. 2011. Determination of reaction coordinates
via locally scaled diffusion map. The Journal of chemical physics 134(12)
Rosenberg S. 1997. The laplacian on a riemannian manifold: An introduction to analysis on
manifolds. London Mathematical Society Student Texts. Cambridge University Press
Roweis S, Saul L. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science
290(5500):2323–2326
Sha F, Saul LK. 2005. Analysis and extension of spectral methods for nonlinear dimensionality
reduction, ICML ’05. New York, NY, USA: Association for Computing Machinery
Shi J, Malik J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 22(8):888–905
Singer A. 2006. From graph to manifold laplacian: The convergence rate. Applied and Compu-
tational Harmonic Analysis 21(1):128–134Special Issue: Diffusion Maps and Wavelets
Singer A, Wu HT. 2012. Vector diffusion maps and the connection laplacian. Communications
on Pure and Applied Mathematics 65(8):1067–1144
32
Slepčev D, Thorpe M. 2019. Analysis of $p$-laplacian regularization in semisupervised learning.
SIAM Journal on Mathematical Analysis 51(3):2085–2120
Tenenbaum JB, de Silva V, Langford JC. 2000. A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500):2319–2323
Ting D, Huang L, Jordan MI. 2010. An analysis of the convergence of graph laplacians, In
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1079–
1086
Ting D, Jordan MI. 2018. On nonlinear dimensionality reduction, linear smoothing and autoen-
coding. arXiv: Machine Learning
von Luxburg U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395–416
Weinberger KQ, Saul LK. 2006. An introduction to nonlinear dimensionality reduction by max-
imum variance unfolding, In Proceedings, The Twenty-First National Conference on Artificial
Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference,
July 16-20, 2006, Boston, Massachusetts, USA, pp. 1683–1686, AAAI Press
Yu K, Zhang T. 2010. Improved local coordinate coding using local tangents, In Proceedings of the
27th International Conference on International Conference on Machine Learning, ICML’10,
p. 1215–1222, Madison, WI, USA: Omnipress
Zha H, Zhang Z. 2007. Continuum isomap for manifold learnings. Computational Statistics &
Data Analysis 52(1):184–200
Zhang Y, Gilbert AC, Steinerberger S. 2022. May the force be with you, In 58th Annual Allerton
Conference on Communication, Control, and Computing, Allerton 2022, Monticello, IL, USA,
September 27-30, 2022, pp. 1–8, IEEE
Zhang Y, Steinerberger S. 2021. t-sne, forceful colorings and mean field limits. CoRR
abs/2102.13009
Zhang Z, Zha H. 2004. Principal manifolds and nonlinear dimensionality reduction via tangent
space alignment. SIAM J. Scientific Computing 26(1):313–338
33