0% found this document useful (0 votes)
95 views33 pages

Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023

This document provides an overview of manifold learning, which is a set of methods for nonlinear dimension reduction. Manifold learning aims to find a low-dimensional structure within high-dimensional data by assuming the data lies on or near a smooth, low-dimensional manifold embedded in the high-dimensional space. The document discusses the mathematical foundations of manifolds and manifold learning, surveys representative manifold learning algorithms, and examines the statistical aspects and applications of these methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views33 pages

Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023

This document provides an overview of manifold learning, which is a set of methods for nonlinear dimension reduction. Manifold learning aims to find a low-dimensional structure within high-dimensional data by assuming the data lies on or near a smooth, low-dimensional manifold embedded in the high-dimensional space. The document discusses the mathematical foundations of manifolds and manifold learning, surveys representative manifold learning algorithms, and examines the statistical aspects and applications of these methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Manifold learning: what, how, and why

Marina Meila,1 Hanyu Zhang,2


November 8, 2023

Abstract
Manifold learning (ML), known also as non-linear dimension reduction, is a set of meth-
arXiv:2311.03757v1 [stat.ML] 7 Nov 2023

ods to find the low dimensional structure of data. Dimension reduction for large, high
dimensional data is not merely a way to reduce the data; the new representations and
descriptors obtained by ML reveal the geometric shape of high dimensional point clouds,
and allow one to visualize, denoise and interpret them. This survey presents the principles
underlying ML, the representative methods, as well as their statistical foundations from
a practicing statistician’s perspective. It describes the trade-offs, and what theory tells us
about the parameter and algorithmic choices we make in order to obtain reliable conclusions.

Contents
1 Introduction 2

2 Mathematical background 3
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Manifold and Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 (Riemannian) geometry on manifolds and isometric embedding . . . . . . . . . . 4

3 Premises and paradigms in manifold learning 6


3.1 Neighborhood graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Linear local approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Principal curves and principal d-manifolds . . . . . . . . . . . . . . . . . . . . . . 8

4 Embedding algorithms 9
4.1 Review: Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . 9
4.2 “One shot” embedding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.2 Diffusion Maps/Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . 11
4.2.3 Local Tangent Space Alignment (LTSA) . . . . . . . . . . . . . . . . . . . 12
4.3 “Horseshoe” effects, neighbor embedding algorithms, and selecting independent
eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3.1 Relaxation-based neighbor embedding algorithms . . . . . . . . . . . . . . 13
4.3.2 Avoiding the REP in spectral embeddings . . . . . . . . . . . . . . . . . . 14
4.4 Summary of embedding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1
5 Statistical basis of manifold learning 16
5.1 Biases in ML. Effects of sampling density and graph construction . . . . . . . . . 18
5.2 Choosing the scale of neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Estimating the intrinsic dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Estimating the Laplace-Beltrami operator . . . . . . . . . . . . . . . . . . . . . . 21
5.5 Embedding distortions. Is isometric embedding possible? . . . . . . . . . . . . . . 22

6 Applications of manifold learning 24


6.1 Manifold learning in statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Manifold learning for visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Manifold learning in the sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Conclusion 26

1 Introduction
Modern data analysis tasks often face challenges of high dimension and thus nonlinear dimension
reduction techniques emerge as a way to construct maps from high dimensional data to their
corresponding low dimensional representations. Finding such low dimensional representations
of high dimensional data is beneficial in several aspects. This saves space and processing time.
More importantly, the low dimensional representation often provides a better understanding of
the intrinsic structure of data, which often leads to better features that can be fed into further
data analysis algorithms. This survey paper reviews the mathematical background, methodology,
and recent development of nonlinear dimension reduction techniques. These techniques have been
developed for two decades since two seminal works: Tenenbaum et al. (2000) and Roweis & Saul
(2000), and are widely used in various data analysis jobs, especially in scientific research.
Before nonlinear dimension reduction emerged, Principal Component Analysis (PCA) was
already widely accepted (I.T.Jolliffe, 2002). Intuitively, PCA assumes that high dimensional
data living in RD lie around a lower dimensional linear subspace of RD and seeks to find the best
linear subspace such that data points projected onto this subspace have minimal reconstruction
error. Nonlinear dimension reduction algorithms extend this idea by assuming data are supported
on smooth nonlinear low dimensional geometric objects, i.e., manifolds embedded in RD , and find
maps that send the samples into lower dimensional coordinates while preserving some intrinsic
geometric information.
In this survey, we start with a brief introduction to the main differential geometric concepts
underlying ML, elaborating on the geometric information that manifolds carry (Section 2). Then,
in Section 3, we describe the paradigm of manifold learning, with three possible sub-paradigms,
each producing a different representation of the data manifold. The rest of the paper focuses on
one of these, namely on the so-called embedding algorithms. In Section 4, we survey representa-
tive manifold learning algorithms and their variants. We also discuss the parameter choices, as
well as some pitfalls, which leads to the discussion in Section 5, where we present the statistical
aspects and statistical results supporting these methods. This section also includes the esti-
mation of crucial manifold descriptors from data: the Laplace-Beltrami operator, Riemannian
metrics, tangent space, intrinsic dimensions. Section 6 discusses applications, connecting with
related statistics problems, and Section 7 concludes the survey.

2
2 Mathematical background
2.1 Notations
In this survey, we use the notation RD to represent the D-dimensional Euclidean space. A
manifold is denoted as M. Lowercase Greek letters such as φ, ϕ, ψ, ρ, · · · represent functions
mapping from M to a subset of Euclidean space, while English letters like f, g, h, · · · denote
functions between real spaces. The notation C ℓ refers to the class of functions with continuous
derivatives up to order ℓ, and C ∞ represents the class of indefinitely differentiable functions.
The Kronecker delta is symbolized by δij .Bold lowercase English letters, like v and x, are used
to denote vectors in Euclidean space. A dataset containing n data points is represented as the
set D = {xi }ni=1 . Bold uppercase letters, such as A, B, C, denote matrices. By convention, we
treat a single data point as a column vector and use the matrix X ∈ Rn×D to represent the data
matrix of a dataset with n data points, each being a vector in D-dimensional Euclidean space,
while Y ∈ Rn×m will represent the same data mapped into m dimension by a manifold learning
algorithm. The notation ∥v∥ signifies the ℓ2 norm of vector v. Throughout this survey, we also
assume that all functions are smooth, i.e., continuously differentiable as many times as necessary.

2.2 Manifold and Embedding


Manifolds and coordinate charts Readers are referred to Lee (2003), do Carmo (1992)
for a rigorous introduction to manifolds and differential geometry. Intuitively, the notion of a
manifold is a generalization of curves and surfaces. Mathematically, M is a smooth manifold
of dimension d (also called a d-manifold) if it is a topological space such that
• For each p ∈ M, there exists a mapping φ and an open neighborhood U ⊂ M of p such
that φ : U → φ(U ) is bijective and both φ, φ−1 are smooth. Such pair (U, φ) is called
a chart, and φ−1 : Rd → M is called local coordinate. Hence, local coordinates map
tuples in Rd to points on M.
• For any two points p, p′ ∈ M and two charts (U, φ), (V, ϕ) containing them, if U ∩ V ̸= ∅,
then map φ ◦ ϕ−1 is smooth on ϕ(U ∩ V ) and has a smooth inverse.
Hence, a smooth manifold is a set that locally around every point resembles an open set in
Euclidean space and for which transitions between charts are seamless. Moreover, with the help
of smooth coordinate charts, one can define differentiable functions of a manifold, or between
two manifolds, in a natural way. echnically, M is also required to be Hausdorff and second- T
countable. But most objects statisticians work with satisfy these conditions. Details about this
mathematical definition of smooth manifolds can be found in the differential geometry textbook,
such as do Carmo (1992).
The simplest example of a manifold is Rd itself, which has a single, global coordinate chart.
The “swiss roll” in Figure 1a is 2-manifolds that also admits a global coordinate chart (into R2 ).
A sphere, or the torus in Figure 1b are also 2-manifolds, but they cannot be covered by a single
chart (they each require at least two), as cartographers well know.
Note also that coordinate charts are not unique; (U, ϕ̃) with ϕ̃ = τ ◦ ϕ is also a coordinate
chart whenever τ : ϕ(U ) → Rd is smoothly invertible (τ in this case is a change of variables).
While the multiplicity of charts, atlases and coordinate functions can be daunting at first sight,
the framework of differential geometry is set up so that calculus, geometric, and topological
quantities related to a manifold M are independent of the coordinates chosen. For example, by
compatibility, it follows that the dimension d must be the same for all charts and atlases. Hence,
d is called the intrinsic dimension of the manifold M.

3
(a) Swiss Roll (b) Torus

For a data scientist, the above means that, (1), they can work in the coordinate system of
their choice, and intrinsic quantities like d will remain invariant. But, (2), care must be taken
when the low dimensional data from two different algorithms, or from different samples are being
compared, because these may not be in the same coordinate system.

Embeddings In differential geometry, an embedding is a smooth map between two manifolds


F : M → N whose inverse F −1 : F(M) ⊂ N → M exists and is also smooth. Of special interest
is the case M ⊂ N ; then M is said to be a submanifold of N . Commonly in statistics, the high
dimensional data lie originally in N = RD , and we model them by M a submanifold of RD to
be estimated. Then D is called the ambient dimension (of the data). The ML algorithms that
we will focus on can be seen as finding an embedding F : M → Rm , with m ≥ d and m ≪ D; in
particular, if m = d, the embedding F is a (global) coordinate chart.
An advantage of embeddings is that one can avoid using multiple charts to describe a man-
ifold. Instead, one can find a global mapping F : M ⊂ RD → N ⊂ Rm , where N is easier
to understand. Whitney’s embedding Theorem (Lee, 2003) states that every d−dimensional
manifold can be embedded into R2d . Therefore, if one can find a valid embedding, a significant
dimension reduction can be achieved (from D to O(d)). This is one of the major targets of
manifold learning algorithms.
This section has introduced manifolds as spaces that are “like Rd ” locally around a point
p. Next, we show how concepts such as distances and angles, that is, Euclidean geometry, are
transferred from Rd to d-manifolds.

2.3 (Riemannian) geometry on manifolds and isometric embedding


For the data in Figure 2, a scientist may be interested in the distance between two molecular
configurations x1 , x2 , seen as points of M ⊂ RD . Their Euclidean distance ∥x1 − x2 ∥ is readily
available without requiring any additional statistics. However, this value may not be of physical
interest, since most of the putative configurations along the segment x1 to x2 in RD are not
physically possible. To deform from state x1 to x2 , the ethanol molecule must follow a path
contained in (or near) the manifold M of possible configurations, and the distance dM (x1 , x2 )
shall naturally be defined as the shortest possible length of such a path (and is called the geodesic
distance). Just like in Rd the distance between two points is independent of the choice of

4
Figure 2: Left: The ethanol molecule has 9 atoms; a spatial configuration of ethanol has D = 3×9
dimensions. The CH3 group (atoms 2,6,7,8) and the OH group (atoms 3,9) can rotate w.r.t. the
middle group (atoms 1,4,5), and the blue and orange lines represent these angles of rotation.
Right: A 2-manifold estimated from 50,000 configurations of the ethanol molecule. The manifold
has the topology of a torus, and the color represents the rotation of the OH group. The sharp
“corners” are distortions introduced by the embedding algorithm (explained in Section 5.1).
Figure 6 shows the original data. This dataset is from Chmiela et al. (2017)

basis, and invariant if Rd is a subspace of a larger Euclidean space, distances along curves in
a manifold M can be defined solely based on the coordinate charts (U, ϕ), hence intrinsically,
without reference to the ambient space RD , and are purely geometric, hence are independent of
the choices of charts. This is achieved through Riemannian geometry, as follows.
In Rd , the scalar product ⟨v, u⟩ = v T u is sufficient to define both distances, by ∥v − u∥2 =
⟨v − u, v − u⟩, and angles, by ∠(v, u) = cos−1 (⟨v, u⟩/(∥v∥∥u∥)). Moreover, any positive definite
matrix A ∈ Rd×d can induce an inner product by ⟨v, u⟩A = v T Au; in this context, A is often
called a metric on Rd .
Riemann took up this idea and introduced the Riemannian metric, which plays the same role
as A above; however, this metric is allowed to vary from point to point, smoothly.

Tangent space and Riemannian metric The tangent space Tp M at a point p ∈ M is a


d-dimensional vector space of tangent vectors to M. The canonical basis of Tp M is given by the
tangents to the coordinate functions seen as curves on M, while the tangent vectors can be seen
as tangents (or velocity vectors) at p to smooth curves on M passing through p. The collection
of tangent spaces Tp M for all points p ∈ M, is called the tangent bundle of M, denoted by
T M.
A Riemannian metric g of a manifold M associates to each point p ∈ M an inner product
⟨·, ·⟩g(p) on the tangent space Tp M, which varies smoothly on M. The inner product g defines on
each tangent space the norm ∥v∥g = ⟨v, v⟩g , distance ||v1 − v2 ||g , angle cos−1 (⟨v1 , v2 ⟩g /(||v1 ||g ||v2 ||g ))
p

for all vectors in Tp M. More importantly, infinitesimal quantities such as the line element
Pd p
dl = i,j=1 gij dxi dxj and volume element dV = det(g)dx1 · · · dxd are also expressed through
the Riemannian metric, allowing one to define lengths of curves and volumes of subsets of M
as integrals. These integrals are invariant to the choice of bases in T M, hence to the choice of

5
Table 1: Three main paradigms for non-linear dimension reduction

Paradigm Representation
F
Linear local xi ∈ U ⊂ M → vi ∈ T̂p M ∼
= Rd D → d, local coordinates only
F
Principal Curves xi ∈ RD → x′i ∈ M ⊂ RD D → D, global coordinates, noise removal
and Surfaces
F
Embedding xi ∈⊂ M → yi ∈ F (M) ⊂ Rm D → m, with m ≥ d, global coordinates (or charts)

coordinate charts on M.

Isometry and isometric embedding A smooth map F : M → N induces linear maps


dFp : Tp M → TF (p) N called the differential of F at p. If we fix the coordinate systems on M
and N , dFp becomes a dim N × dim M matrix which maps v ∈ Tp M to dFp v ∈ TF (p) N (i.e.,
the Jacobian of F in the given coordinates).
A smooth map F : M → N between Riemannian manifolds (M, g), (N , h), is an isometry if
the Riemannian metric g at each point p is preserved by F , i.e. iff

for all p ∈ M and v1 , v2 ∈ Tp M, ⟨v1 , v2 ⟩g(p) = ⟨dFp (v1 ), dFp (v2 )⟩h(F (p)) (1)

An isometry F preserves geometry quantities such as angles, distances, path lengths, volumes
etc. An embedding that is also an isometry is called an isometric embedding.
Ideally, we would like a manifold learning algorithm to produce an embedding F into Rm that
is isometric. Here we face one of the most remarkable gaps between mathematical theory and
methodology in manifold learning. Although it was long proved (Nash embedding theorem (Lee,
2003)) that isometric embedding is possible, no known practical algorithm capable of isometric
embedding exists at this time (details and refinement of this statement are in Section 5.5).
However, by estimating auxiliary information, working with a non-isometric embedding as if it
was isometric is still possible (see Section 5.5).

3 Premises and paradigms in manifold learning


The Manifold Assumption Suppose we are given data {xi }ni=1 where each data entry xi ∈
RD . It is assumed that data are sampled from a distribution P that is supported on, or close to
a d dimensional manifold M embedded in RD . This is the Manifold Assumption. Throughout
this survey, with a few noted exceptions, we will discuss the no noise case, when the data lie
exactly on M.

Manifold learning A manifold learning algorithm can be thought as a mapping F of xi ∈ RD


to yi ∈ Rm . The embedding dimension m is usually much smaller than D but could be higher
than the intrinsic dimension d. In the regime that P is supported exactly on M, and sample size
n → ∞, a valid manifold learning algorithm F should converge to a smooth embedding function
F . This implies that the algorithm should be guaranteed to recover the manifold M , regardless
of the shape of M.
Once the manifold assumption is believed to be true, most manifold learning and non-linear
dimension reduction methods can be grouped into three paradigms, which differ in the way they
represent the recovered manifold. They are local linear approximations (Section 3.2), Principal
Curves and Surfaces (3.3), and embedding algorithms, which will be the focus of Section 4.

6
The Manifold Assumption itself is testable. For example in Fefferman et al. (2016), tests
whether, given an i.i.d. sample, there exists a manifold M that can approximate this sample
with tolerance ε. These results are currently not practically useful, as knowledge of usually
unknown parameters of the manifold (d, reach, volume) must be known or estimated. However,
they, as well as Genovese et al. (2012), give us the confidence to develop and use ML algorithms
in practice.

3.1 Neighborhood graphs


Practically all manifold learning algorithms start with finding the neighbors of each data point
xi . This leads to the construction of a neighborhood graph; this graph, with suitable weights,
summarizing the local geometric and topological information in the data, is the typical input
to a non-linear dimension reduction algorithm. Every data point xi represents a node in this
graph, and two nodes are connected by an edge if their corresponding data points are neighbors.
Throughout the survey, we use Ni to denote the neighbors of xi and ki = |Ni | be the number
of neighbors of xi (including xi itself). The matrix Ni ∈ Rki ×D is the matrix with each row
representing a neighbor of xi .
There are two usual ways to define neighbors. In a radius-neighbor graph, xj is a neighbor
of xi iff ||xi − xj ||≤ r. Here r is a parameter that controls the neighborhood scale, similar to a
bandwidth parameter in kernel density estimation. Consistency of manifold learning algorithms
is usually established assuming an appropriately selected neighborhood size, that decreases slowly
with n (see Section 5.2). In the k-nearest neighbor (k-NN) graph, xj is the neighbor of xi iff xj is
among the closest k points to xi . Since this relation is not symmetric, usually the neighborhoods
are symmetrized to obtain an undirected neighborhood graph.
The k−NN graph has many computational advantages w.r.t. the radius neighbor graph; it
is more regular and often it is connected when the latter is not. More software is available to
construct (approximate) k-NN graphs fast for large data. But theoretically, it is much more
difficult to analyze, and fewer consistency results are known for k−NN graphs (Sections 5.1,
5.4). Intuitively, ki the number of neighbors in the radius graph is proportional to the local data
density, and manifold estimation can be analyzed through the prism of kernel regression; while
the k-NN graph is either asymmetric, or if symmetrized, becomes more complicated to analyze.
The distances between neighbors are stored in the distance matrix A, with Aij being the
distance ||xi − xj || if xj ∈ Ni , and infinity if xj is not a neighbor of xi .
Some algorithms weight the neighborhood graph by weights that are non-increasing with
distances; the resulting n × n matrix is called the similarity matrix (or sometimes kernel matrix).
The weights are given by a kernel function,
(  ||x −x || 
i j
K h , xj ∈ Ni ,
Kij := (2)
0, otherwise.

The kernel function here is almost universally the Gaussian kernel, defined as K(u) =
exp(−u2 ) (Belkin et al. (2006), Ting et al. (2010), Coifman & Lafon (2006), Singer & Wu
(2012),etc) . In the above, h, the kernel width, is another hyperparameter that must be tuned.
Note that, even if Ni would trivially contain all the data, the similarity Kij vanishes for far-
away data points. Therefore, equation (2) effectively defines a radius-neighbor graph with r ∝ h.
Hence, a rule of thumb is to select r to be a small multiple of h (e.g., 3–10h).
It is sometimes also useful to have kernel function K(u) = 1. Then the similarity matrix K
is the same as the unweighted adjacency matrix of the radius neighbor graph. By construction,
K is usually a sparse matrix, which is useful to accelerate the computation.

7
If the data is so that D is large, which is often the case, and n is large too, which is necessary
for manifold recovery, constructing the neighborhood graph can be the most computationally
demanding step of the algorithm. Fortunately, much work has been devoted to speeding up this
task, and approximate algorithms are now available, which can run in almost linear time in n
(computation complexity can be reduced to O(n1+δ ), where δ < 1 is a positive constant) and
have very good accuracy (Ram et al. (2009)).
Next, we briefly describe two of the paradigms in Table 1, then, from Section 4 on focus on
the third, embedding algorithms.

3.2 Linear local approximation


This idea stems from the classical Principal Component Analysis, and it seeks to adapt it
to data sampled from a curved manifold, instead of a linear subspace. Random projections
(Baraniuk & Wakin, 2009, Hegde et al., 2007) were first proposed to learn the structure of a
low dimensional manifold. Let F : RD → Rm be a random orthogonal projector1 , it is suf-
to have m ≥ O(d log n/ε2 ) to preserve all distances
ficient p p approximately, in the sense that
(1 − ε) m/n ≤ dist(F (xi ), F (xj ))/ dist(xi , xj ) ≤ (1 + ϵ) m/n holds for any xi , xj from M.
Here dist can be either geodesic distance on M or Euclidean distance on RD . For large data,
however, this approach leads to a large m, not useful for dimension reduction.
Both PCA and random projections seek global linear representations, and they do not utilize
the geometric structure of the manifold in a finer scale at a reference point x. One improvement
is to perform PCA on a weighted covariance matrix, with weights decaying away from x,i.e. let
n
1X
C= wi (xi − x)(xi − x)⊤ . (3)
n i=1

The weights can be computed using the kernel K() and bandwidth h used for the neighborhood
graph, for example. This weighted PCA procedure is sometimes termed local PCA (lPCA).
Local PCA is only used to understand geometric structure near any reference point, usually
one of the points in D. One can map all the data by performing lPCA with a subset of the
data points so that each xi is sufficiently well approximated by its projection. This way, all the
data are represented in d dimensional coordinates. However, some points can have more than
one representation, if they are close to 2 or more reference points. All the representations (i.e.,
coordinates) are local, and understanding or making inferences on the entire manifold are tedious
at best. . This approach can be refined into a multiscale local linear approach, which is more
accurate and parsimonious (Chen et al., 2013).

3.3 Principal curves and principal d-manifolds


This paradigm is the only one of the three where noise is assumed. Consider data of the form
xi = x∗i +ϵi , where ϵi represents 0-mean noise, and the x∗i are sampled from a curve, for instance.
Then, to recover the curve, one would need to estimate the local mean of the data, and this
was proposed in the seminal work of Hastie & Stuetzle (1989); unfortunately, the estimation is
difficult, and this definition does not lead to a unique curve. A more recent proposal that removed
the previous difficulties was to define the principal curve as the ridge of the data density. point A
x is on a ridge if it is a local maximum of the density fx in D − 1 directions, and the remaining
direction coincides with the gradient ∇fx . Several ridges may meet at a peak, i.e., at a local
1 Obtained by orthogonalizing a matrix with i.i.d. normal or Bernoulli entries

8
maximum, but in between, the ridges are manifolds (Chen et al., 2015) if the density fx is smooth
enough. This concept can be extended to principal surfaces, and principal d-manifolds.
The ridge can be estimated with the Subspace Constrained Mean Shift (SCMS) algorithm
(for more details, we refer the reader to Ozertem & Erdogmus (2011)). The SCMS algorithm
maps each xi to a point yi ∈ RD lying on the principal curve (or d-manifold). Hence, as a
manifold estimation algorithm, this method does not reduce dimension. However, unlike local
linear maps, the output is in a global coordinate system, i.e. RD .
Usually, the ridge does not coincide with the mean of the data; the bias depends on the
curvature of the manifold: the density is higher on the “inside” of the curve. However, for
their smoothing property, principal d-manifolds are remarkably useful in the analysis of manifold
estimation in noise (Genovese et al., 2012, Mohammed & Narayanan, 2017).
We have quickly reviewed two simple methods for manifold estimation: local linear approxi-
mation reduces the dimension locally, but offers no global representation, while principal curves
produce a global representation but do not reduce dimension. It is time to focus on the third class,
of algorithms that produce embeddings, representations that are both global and low dimension.

4 Embedding algorithms
The term ”manifold learning” was proposed in the seminal work of two algorithms: LLE(Roweis
& Saul (2000)) and Isomap(Tenenbaum et al. (2000)), together inaugurated the modern era
of non-linear dimension reduction. In this subsection, we introduce classical manifold learning
algorithms that (attempt to) find a global embedding Y ∈ Rn×m of data set D.
We separate the algorithms, roughly, into “one-shot” algorithms, which obtain embedding co-
ordinates from the principal eigenvectors of some matrix derived from the neighborhood graph, or
by solving some other global (usually convex) optimization problem, and “attraction-repulsion”
algorithms, which proceed from an initial embedding Y (often produced by a one-shot algorithm)
and improve it iteratively. While this taxonomy can rightly be called superficial, at present, it
represents a succinct and relatively accurate summary of the state of the art.
No matter what the approach, given the neighborhood information summarized in the weighted
neighborhood graph, an embedding algorithm’s task is to produce a smooth mapping of the in-
puts which distorts the neighborhood information as little as possible. The algorithms that
follow differ in their choice of information to preserve, and in the sometimes implicit constraints
on smoothness.

4.1 Review: Principal Component Analysis (PCA)


Before we dive deep into various non-linear dimension reduction algorithms, we have a brief
review on linear dimension reduction methods.
Linear dimension reduction methods find global embedding of the data in a low dimensional
linear subspace. One way of understanding principal component analysis is to find d dimensional
linear subspace V such that the data {xi } projected onto it have the smallest reconstruction
error. If V has an orthogonal basis T ∈ RD×d such that T⊤ T = Id . Then xi projected onto U
has low dimensional representation yi = T⊤ xi under basis T. In RD , projection of xi onto U
is given by TT⊤ xi . If we introduce the data matrix X ∈ Rn×D , with i−th row being x⊤ i , then
the low dimensional representation matrix Y ∈ Rn×d is given by XT and in RD the projected
data matrix is XTT⊤ .

9
Then we can write PCA problem as
n
X
min ||xi − TT⊤ xi ||2 = min ||X − XTT⊤ ||2F (4)
T:T∈RD×d ,T⊤ T=Id T:T∈RD×d ,T⊤ T=Id
i=1

Consider the singular value decomposition preserving only the first d singular values of X =
UΣV⊤ where U ∈ Rn×d , V ∈ RD×d are orthogonal matrices and Σ is d × d diagonal matrix,
then solution of this problem is T = V. Low dimensional representation of original data is
Y = XT = UΣ, these are also called principal components. In the terminology of PCA,
columns of V are called principal vectors. They characterize the directions that explain the
variance in the data.
When data xi are centered, the (unnormalized) sample covariance matrix of data is C =
X⊤ X. The solution to PCA can also be found by eigendecomposition of C. The first d eigen-
vectors of C is just matrix V. If the dimension D ≫ n, it will be easier to first compute the
Gram matrix C = X⊤ X and then perform a truncated eigen-decomposition C = VΣ2 V⊤ . Low
dimensional representation is still XV.

4.2 “One shot” embedding algorithms


4.2.1 Isomap
Classical Multidimensional Scaling (MDS,Kruskal (1964)) takes input from a pairwise distance
matrix A and outputs coordinates Y ∈ Rd that best preserve the distances. Isomap is a
generalization of multidimensional scaling that preserves distances between data points while
finding low dimensional coordinates. Instead of Euclidean distance in classical MDS, Isomap
use shortest path distances in the neighborhood distance graph to approximate geodesic distance
on a manifold.

Algorithm 1 Isomap
Input: : Neighborhood distance matrix A, embedding dimension m
1: Compute shortest path distance matrix Ãij ::
(
Aij Aij < ∞ ,
Ãij =
shortest path distance between i, j Aij = ∞ .

Multidimensional Scaling Y = MDS(M, d) with hM = [Ã2ij ]


2:
Output: : m dimensional coordinates Y for D

Intuitively, the shortest graph distance is a good approximation to the geodesic distance in
a neighborhood provided that data are sufficiently dense in this region and neighborhood size
is appropriately chosen (Bernstein et al., 2000). In the limit of large n, Isomap was shown
to produce isometric embeddings for m = d, whenever the data manifold is flat, i.e. admits an
isometric embedding in Rd , and data space is convex. Empirically, Isomap embeddings are
close to isometric also when m > d and m is sufficient for isometric embedding.
Computation complexity of Isomap is O(n3 ), with a most computational burden for com-
puting all pairs of shortest path distance. Space complexity is O(n2 ). Since Isomap works with
dense matrices, this space complexity cannot be improved.
There are variants of Isomap that improve it in different ways: Hessian Eigenmap (Donoho
& Grimes, 2003) enables non-convex data where they introduce the use of Hessian operator;

10
Continuum Isomap (Zha & Zhang, 2007) generalizes Isomap to a continuous version such that
out-of-sample extension of Isomap is possible.

4.2.2 Diffusion Maps/Laplacian Eigenmaps


Unlike Isomap, DM, as well as most embedding methods, work with a sparse matrix derived
from the similarity K; namely, they embed the data by the eigenvectors of the graph Laplacian
L. The construction of L, also called the Diffusion Maps Laplacian or renormalized Laplacian,
is described in Algorithm 3 (and it consists of a column normalization of K, followed by a row
normalization). The LE algorithm differs from DM only in the use of a different Laplacian,
Lnorm below.

Algorithm 2 Diffusion Maps/Laplacian Eigenmaps


Input: Graph Laplacian L (or Lnor , embedding dimension m.
1: Compute eigenvectors of smallest m + 1 eigenvalues of L, {v i }m i n
i=0 , each eigenvector v ∈ R .

Discard v 0 .
2:
Represent each xj by yj = (vj1 , · · · , vjm )⊤
3:
Output: Y

P
To construct a graph Laplacian matrix, let di = j∈Ni Kij represent the degree of node i
and D = diag{d1 , · · · , dn }. Then multiple choices of graph Laplacian exist:

• Unnormalized Laplacian: Lun = D − K


• Normalized Laplacian: Lnor = I − D−1/2 KD−1/2
• Random-walk Laplacian: Lrw = I − D−1 K

• Renormalized Laplacian L described below.

Algorithm 3 Renormalized Laplacian


Input: Neighborhood similarity matrix K, kernel function k() and kernel bandwidth h
Compute similarity matrix Kij = k(Aij /h)
Pn
Normalize columns: dj = i=1 Kij , K̃ij = Kij /dj for all i, j = 1, . . . n
n
Normalize rows: d′i = j=1 K̃ij , Pij = K̃ij /d′i for all i, j = 1, . . . n
P

Output: L = (I − P)/h2

Why one Laplacian rather than another? The reason is that, even though in many simple
examples the difference is hard to spot, one needs to ensure that, as more sample sizes are
collected, the limit of these L’s is well defined, and the embedding algorithm is unbiased. It
is easy to see that Lnorm and Lrw are similar matrices. Moreover, whenever the degrees di
are constant, L = Lrw ∝ Lun , hence all Laplacians should produce the same embedding. The
difference appears when the data density is non-uniform, making one xi be surrounded more
densely by other data points. The seminal work Coifman & Lafon (2006), which introduced
renormalization, showed that in this case, the eigenvectors of Lnorm , Lrw are biased by the
sampling density and that renormalization removes this bias. Sections 5.4, and Figure 6 illustrate
this.

11
The idea of spectral embedding appeared in (Shi & Malik, 2000, Belkin & Niyogi, 2002)
as a trick for clustering, and was then generalized as a data representation method in Belkin
& Niyogi (2003) as LE. They connect the graph Laplacian with the famous Laplace-Beltrami
operator ∆M of manifold, a differential operator that plays an an important role in modern
differential geometry (Rosenberg, 1997). Estimating the Laplace-Beltrami operator itself is an
important geometric estimation problem that will be reviewed in Section 5.4.
Because L is sparse, DM/LE are computationally less challenging when compared with
Isomap.

4.2.3 Local Tangent Space Alignment (LTSA)


This algorithm, proposed in (Zhang & Zha, 2004) seeks to find local representation in the tangent
space at each point xi , then aligns these to obtain global coordinates.
The first stage of LTSA finds the local representation of neighboring points j ∈ Ni via projec-
tions on the tangent Txi M; thus yj − yi can locally be approximated by an affine transformation
of orthogonal projections of xj onto tangent space at xi through Taylor expansion. The optimal
affine transformation is obtained by minimizing the reconstruction error near each xi
(i)
X
min ∥xj − (x̃i + Qθj )∥2 , (5)
x˜i ,Θ,Q
j∈Ni

Where x, Q are translation and rotation that parametrize this affine transformation. θj is a
local coordinate of each neighbor xj projected on this linear subspace.
In the second stage of LTSA, one obtains global embedding coordinates Y while θj that
preserves local geometry information, through minimizing a global reconstruction error
n X
(i)
X
min ∥yj − ỹi − Pi θj ∥2 (6)
{yi }i=1 ,{Pi }n
n
i=1
i=1 j∈Ni

The optimization in both steps can be transformed into eigenvalue problems. Hence the
algorithmic procedure of LTSA is displayed in algorithm 4

Algorithm 4 Local tangent space alignment


Input: Dataset D, embedding dimension m.
B=0
for i = 1, 2, · · · , n do
Find the k nearest neighbors of xi : xj , j ∈ Ni .
Find local dataset Ξi = [xj − x̃i ]j∈Ni , where x̃i is the average of all neighbors
√ of xi .
Compute the m largest eigenvectors ṽ 1 , · · · , ṽ d of Ξ⊤ Ξ, set Gi = [1/ k, ṽ 1 , · · · , ṽ d ]
B = B + I − Gi G⊤ i .
end for
Compute the 2 to m + 1 smallest eigenvectors of B, {v j }m j
j=1 , each eigenvector v ∈ R .
n

Output: m dimensional embeddings yi = (vi1 , · · · , vim )

We have seen three embedding algorithms so far in this section: Isomap,LE,LTSA, which
all come with a certain performance guarantee. On the other hand, locally linear embedding
(LLE, Roweis & Saul (2000)) is a heuristic method that utilizes a very similar idea: estimate
local representations first and then align them globally. However, vanilla LLE does not perform
well empirically and lacks theoretical guarantees (Ting et al., 2010, Ting & Jordan, 2018).

12
4.3 “Horseshoe” effects, neighbor embedding algorithms, and selecting
independent eigenvectors
Algorithms that use eigenvectors, such as DM, are among the most promising and well-studied
in ML (see Sections 5.1,5.2,5.4). Unfortunately, such algorithms fail when the data manifold has
a large aspect ratio (such as a long, thin strip, or a thin torus). This problem has been called
the Repeated Eigendirection Problem (REP) and has been demonstrated for LLE, LE, LTSA,
HE (Goldberg et al., 2008), and is pervasive in real data sets.
From a differential geometric standpoint, the REP is a drop in the rank of the embedding
Jacobian, due to eigenvectors (or eigenfunctions, in the limit) that are harmonics of previous ones,
as shown in Figure 3. For example, for a rectangular strip (and a finite sample), the scatterplot of
(hvi1 , hv 2 )i=1,...n from step 3 of the DM follows a parabola; hence, it is a 1-dimensional mapping,
even though the rectangle is 2-dimensional. This fact is a relevant diagnosis for REP in practice:
when an embedding looks like a “horseshoe”, this may not represent a property of the data,
but an artifact signalling that one of the data dimensions is collapsed, or poorly reflected in the
embedding (Diaconis et al., 2008).

4.3.1 Relaxation-based neighbor embedding algorithms


The pervasiveness of the REP stimulated the development of algorithms that balance attraction
between neighbors in the original space, with repulsion between neighbors in the embedding space
(van der Maaten & Hinton, 2008, McInnes et al., 2018, Jacomy et al., 2014, Carreira-Perpiñan,
2010, Im et al., 2018). Usually, the embedding coordinates Y are optimized iteratively until
equilibrium is reached.
The t-SNE algorithm van der Maaten & Hinton (2008), one variant of which Böhm et al.
(2022) we briefly describe here, exemplifies this approach.
Stochastic neighbor embedding (SNE), proposed in Hinton & Roweis (2002), change
neighborhood relationship from a hard 0-1 coding to conditional probabilities. The algorithm
computes two sets of conditional probabilities: pij which models the probability of xi , xj being
neighbors (and is the algorithm input), and qij that models the probability of output points
yi , yj being neighbors. In van der Maaten & Hinton (2008), the authors proposed to use a
Student-t distribution to model these conditional probabilities and, as t-SNE, this algorithm
became widely used.
In more detail, from pij and qij , t-SNE constructs two similarity matrices; V is the sim-
ilarity between data points, calculated as V = (D−1 K + KD−1 )/(2n), where K denotes the
k-nearest neighbor similarity matrix. The matrix W = [ Wij ]bni,j=1 represents similarities in the
embedding space; Wij = 1+A1 out where Aout 2
ij = ∥yi − yj ∥ is the squared distance matrix in the
ij
embedding space, a dense matrix.
The t-SNE algorithm starts with arbitrary coordinates Y ∈ Rn×m , and iteratively updates
them by gradient descent to minimize the following loss function, which is akin to a cross-entropy
(Hinton & Roweis, 2002, van der Maaten & Hinton, 2008).
X1 X
Lt−SN E = − Vij ln Wij + ln Wij . (7)
i,j
n ij
P
In the above, ij Wij = wtot normalizes the entries of W to 1. Thus, the original aim of t-
SNE is to match the (normalized) data weights by the (normalized) embedding weights around
each point, which motivates the name Stochastic Neighbor Embedding (SNE, Hinton & Roweis
(2002))..

13
Uniform manifold approximation and projection (UMAP, McInnes et al. (2018)) is another
popular heuristic method. On a high level, UMAP minimizes the mismatches between topological
representations of high-dimensional data set {xi }ni=1 and its low-dimensional embeddings yi .
Theories of UMAP are still very limited.
t-SNE has the advantage of being sensitive to local structure and to clusters in data (Lin-
derman & Steinerberger, 2019, Kobak et al., 2020) (but does not explicitly preserve the global
structure). We note that this propensity for finding clusters comes partly from the choice of
neighborhood graph (Section 5.1). However, this is not the whole story. Recently, it has been
shown that this property stems from the gradient of the loss function Lt−SN E , which has the
form
∂Lt−SN E X n X Wij
= − Vij Wij (yi − yj ) + (yi − yj ). (8)
∂yi ij
ρ ij wtot

In the above, the first term is an attraction between graph neighbors, while the second represents
repulsive forces between the embedded points y1:n (Böhm et al., 2022, Zhang et al., 2022).
Note the additional important parameter ρ, which controls the trade-off between attraction and
1/ρ
repulsion (ρ corresponds to a version of the cost with wtot in the second term). In Böhm et al.
(2022) it is shown that varying ρ from small to large values ρ decreases the cluster separation, and
makes the embedding more similar to the LE embedding. Moreover, quite surprisingly, Böhm
et al. (2022) shows that by varying ρ, the t-SNE can emulate a variety of other algorithms, most
notably.
UMAP (McInnes et al., 2018) and ForceAtlas (Jacomy et al., 2014).. Other works that
analyze the attraction-repulsion behavior of t-SNE are Zhang & Steinerberger (2021). One yet
unsolved issue with t-SNE is the choice of the number of neighbors k. Most applications use
the default k = 90 (Poličar et al., 2019); this choice, as well as other behaviors of this class of
algorithms, are discussed in Zhang et al. (2022).
Note also that since the REP can be interpreted as extreme distortion, the Riemannian-
Relaxation (Perrault-Joncas & Meila (2014) in Section 5.5) can also, be used to improve the
conditioning of an embedding in an iterative manner.
Finally, in Minimum Variance Unfolding (MVU) , proposed in Weinberger & Saul (2006),
Arias-Castro & Pelletier (2013), repulsion is implemented via a Semidefinite Program, hence the
embedding Y is obtained by solving a convex optimization. This algorithm can be seen both as
a one-shot and as an attraction-repulsion algorithm; Diaconis et al. (2008) show that MVU is
related to the fastest mixing Markov chain on the neighborhood graph.

4.3.2 Avoiding the REP in spectral embeddings


For algorithms like DM, and LTSA, the REP has a theoretically straightforward solution. Given
′ ′
a sequence of eigenfunctions F 1 , . . . , F m . . . on M (or eigenvectors v 1 , . . . v m in the finite sam-

ple case), with m > m, sorted by their corresponding eigenvalues, one needs to select F j1 = F 1 ,
then (recursively) F j2 , . . . F jm so that rank[(dF 1 )p , . . . (dF jm )p ] = d for all p ∈ M. This is
called Independent Eigendirection Selection (IES). In a finite sample, the rank condition must
be replaced with the well-conditioning of dF at the data points. Dsilva et al. (2018) proposed
to measure dependence by regressing vjk+1 on the previously selected vj1 ,...jk ; in Chen & Meila
(2021), a condition number derived from the embedding metric (Section 5.5) is used to evaluate
entire sets of m eigenvectors. The manifold deflation method Ting & Jordan (2020) proposes
to bypass eigenvector selection by choosing a linear combination of all eigenvectors that are op-
timized w.r.t. rank. Finally, the Low Distortion Local Eigenmaps (LDLE) Kohli et al. (2021)
solves the REP by essentially covering the data manifold with contiguous patches (discrete ver-
sions of the U neighborhoods) and performing IES on each patch separately. LDLE not only

14
Figure 3: Embedding algorithms failing to find a full rank mapping, if they greedily select the
first m = 2 eigenvectors, and correction by a more refined choice of eigenvectors. Top row:
Embeddings of galaxy spectra from the SDSS (Section 6) by DM ; middle “horseshoe” when
first 2 eigenvectors are used; right the same data, with selection of the second eigenvector (in
this case by Chen & Meila (2021)). Bottom row: embeddings of a swiss roll with length 7 times
the width. Left: first 2 eigenvectors from DM/LE; middle after UMAP. Note that UMAP by
itself is not able to produce a full-rank embedding everywhere; the horseshoe, the two clusters,
and the 1 dimensional “filament” between are all artifacts. Right: UMAP with selection of the
second eigenvector by Chen & Meila (2021). Plots by Yu-Chia Chen.

15
avoids REP, but it is a first step towards the algorithmic use of charts and atlases to complement
global embeddings.
In summary, attraction-repulsion algorithms such as t-SNE, which are heuristic, enjoy large
popularity due in part to their immunity to the REP, while eigenvector based methods, although
better grounded in theory, are less useful in practice without post-processing by an IES method.
On the other hand, unlike global search in eigenvector space, a local relaxation algorithm cannot
resolve the rank deficiency globally, and it may become trapped in a local optimum (Figure 3).

4.4 Summary of embedding algorithms


A variety of embedding algorithms have been developed. Here we presented representative al-
gorithms of two types. One-shot algorithms that (typically) embed the data by eigenvectors,
of which Isomap, DM and LTSA are the best understood as well as computationally scalable.
The main drawback of this class of algorithms is the Repeated Eigendirections Problem, which
requires post-processing of the eigenvectors. Neighbor embedding algorithms are (typically) it-
erative, starting with the output of a one-shot algorithm (LE for UMAP) or even PCA. The
presence of repulsion makes these algorithms robust to REP which affects one-shot algorithms.
Quantifying the repulsion, as well as the smoothness, large sample limits as well as other prop-
erties of the neighbor embedding algorithms are less developed at this time. Hence, for the
moment, neighbor embedding algorithms remain heuristic for ML, while they remain useful for
visualization, and clustering (for which guarantees exist Linderman & Steinerberger (2019)).
Neither type of algorithm guarantees against local singularities, such as the “crossing” in
Figure 3. Currently, it is not known how these can be reliably detected or avoided. Additionally,
all algorithms distort distances except in special cases (as discussed in Section 5.5).
All algorithms depend on hyperparameters: intrinsic dimension d (Section 5.3) or embedding
dimension m, and k or r for the neighborhood scale (Section 5.2). Iterative algorithms often
depend on additional parameters that control the repulsion (such are ρ in t-SNE), or the descent
algorithm.
With respect to computation, constructing the neighborhood graph is the most expensive step
typically for n large. To compound this problem, finding k or r in a principled way often requires
constructing multiple graphs, one for each scale. One-shot algorithms that compute eigenvectors
are quite efficient for n up to 106 when the matrix has a sparsity pattern that corresponds to
the neighborhood graph. Neighbor embedding algorithms work, in theory, with dense matrices
(e.g. W); however, accelerated approximate versions for these algorithms have been developed
such as the Barnes-Hut trees approximation van der Maaten (2014), and the negative sampling
heuristic for UMAP Böhm et al. (2022), McInnes et al. (2018).
Figure 4: Data
5 Statistical basis of manifold learning sampled from
the chopped
The output or result of manifold learning algorithms depends critically on algorithm parameters torus
such as the type of neighborhood graph (k-nearest neighbor or radius neighbor), the neighborhood
scale (k or r), and embedding dimension m (and intrinsic dimension d, in some cases).
This section is concerned with making these choices in a way that ensures some type of
statistical consistency, whenever possible. Neglecting statistical consistency and stats in general
is risky. In the worst case, it can lead to methods that have no limit when n → ∞ (e.g. for LLE
without any regularization), and in milder cases to biases (e.g. due to variations in data density),
and artifacts, i.e., features of the embedding such as clusters, arms, and horseshoes that have no
correspondence in the data.

16
(a) Isomap (b) LE (c) LLE

(d) LTSA (e) t-SNE (f) UMAP

Figure 5: Embedding obtained from the algorithms in this section on a chopped torus data set
with n = 14, 519 points. This manifold cannot be embedded isometrically in d = 2 dimensions.

17
Here we discuss in more general terms what is known about graph construction methods
(Section 5.1), the neighborhood scale (Section 5.2), and the intrinsic dimension (Section 5.3).
We revisit the estimation of the Laplacian (a normalized version of the neighborhood graph),
as the natural representation of the manifold geometry, and the basis for the Diffusion Maps
embedding, which can be seen as the archetypal embedding (Section 5.4). Finally, in Section
5.5, we turn to mitigate the distortions that the embedding algorithm currently induces.

5.1 Biases in ML. Effects of sampling density and graph construction


Biases due to non-uniform sampling Many embedding algorithms tend to contract re-
gions of M where the data are densely sampled and to stretch the sparsely sampled regions.
In attraction-repulsion algorithms, such as t-SNE, this is explained by the repulsive forces be-
tween every pair of embedding points yi , yj , while the attractive forces act only along graph
edges. If two dense regions are connected by fewer graph edges, repulsion will push them apart,
exaggerating clusters.
The effect is similar, albeit less intuitive to explain, for one-shot algorithms, as shown in
Figure 6. For DM and the graph Laplacian , the effect was calculated in Coifman & Lafon
(2006); they also showed that renormalization removes the bias due to non-uniform sampling
(asymptotically). Moreover, the degree values d′i obtained in Laplacian are estimators of the
sampling density around data point xi . A simpler method of renormalization, applicable to low
dimensional data is to use a simple estimator of the local density, and to use it to renormalize
Lrw Luo et al. (2009).
If enough samples are available, one can simply resample the data to obtain an approximately
uniform distribution. For example, the farthest point heuristic chooses samples sequentially, with
the next point being the farthest away from the already chosen points.

Effect of neighborhood graph (Figure 6) Radius neighbor graph of k-nearest neighbors?


Ting et al. (2010) and later Calder & Trillos (2019) show that the k-nearest neighbor graph,
with the similarity matrix with constant kernel K(u) = 1 exhibits qualitatively similar biases
from non-uniform sampling as the simply normalized radius-neighbor graphs

5.2 Choosing the scale of neighborhood


Whatever the task, a manifold learning method requires the user to provide an external param-
eter, be it the number of neighbors k or the kernel bandwidth h, that sets the scale of the local
neighborhood.

Theoretical results/Asymptotic results and what they mean The asymptotic results
of Giné & Koltchinskii (2006), Hein et al. (2007), Ting et al. (2010) and Singer (2006) provide
the necessary rates of change for h with respect to n to guarantee convergence of the respective
estimate. For instance, Singer (2006) proves that the optimal bandwidth parameter for Laplacian
1
estimation is given by h ∼ n− d+6 using a random-walk Laplacian. For the k-neareast neighbor
graph, Calder & Trillos (2019) show that again for Laplacian estimation, the number of neighbors
4 d
k must grow slowly with n, and a recommended rate is k ∼ n d+4 (log n) d+4 . The hidden constant
factor in these results is not completely known, but they depend on the manifold volume, the
curvature, and the injectivity radius τ (typically not known in practice).
With these rate-wise optimal selections of k or r, the convergence rate of estimation relating
to various objects on the manifold can be established. However, all are non-parametric rates.
More specifically, they point to the fact that the sample size n must grow exponentially with

18
Figure 6: Effects of graph construction and renormalization, when the sampling density is highly
non-uniform, exemplified on the configurations of the ethanol molecule. Left: original data, after
preprocessing, is a noisy torus, with three regions of high density, around local minima of the
potential energy. Center: Embeddings by DM (purple), and by the same algorithm with L
constructed from the k-nearest neighbor graph (yellow). The low sparse regions are stretched,
while the dense regions appear like “corners” of the embedding. Note that DM should remove
the effects of the density; in this case, the variations in density are so extreme that the effect
persists. The effect is somewhat stronger for the k-nearest neighbor graph. Right: Embedding
by DM (purple) and by LE (yellow), which uses the singly normalized Lrw .

the dimension. For


q example, using the previously mentioned rate of k, together with the rate
log n k 1/d

of convergence ≈ k n , one can calculate that, for a 10-fold decrease in error, n must
increase by ≈ 10(d+4)/3 . While the actual constants are not known, the statistical results suggest
that, in practice, for one-shot algorithms, values of k should be sufficiently large, in order to be
close to the maximum accuracy supported by the sample.
For neighbor embedding algorithms, such as t-SNE, less is known theoretically; however,
practically, the defaults are for larger values of k, e.g. k = 90 Poličar et al. (2019) and some
research suggests k ∼ n, which would create very dense graphs.

Practical methods Unfortunately, cross-validation (CV), a widely useful model selection


method in, e.g., density estimation, is not applicable in manifold learning, for the lack of a
criterion to cross-validate. (However, CV is still applicable in semi-supervised learning on mani-
folds Belkin et al. (2006).) The ideas we describe below each mimic CV by choosing a criterion
that measures the “self-consistency” of an embedding method at a certain scale.
For the k-nearest neighbor graph, Chen & Buja (2009) evaluates a given k with respect to
the preservation of k ′ neighborhoods in the original graph. The method is designed to optimize
for a specific embedding, so the values obtained for k depend on the embedding algorithm used.
A problem to be aware of with this approach is that (see Section 5.5) most embeddings distort
the data geometry, hence Euclidean neighborhoods will not be preserved, even at the optimal k.
For the radius-neighbor graph, Perraul-Joncas & Meila (2013) seeks to exploit the connection
between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami
operator. The radius neighbor graph width h affects the Laplacian’s ability to recognize local
isometry. Recall that local isometry is easily obtained by projecting the data on the tangent space

19
at some point xi . The method is specific to the estimation of the Laplace-Beltrami operator,
but in this context, it can be extended to optimization over other parameters, such as kernel
smoothness.
Finally, we mention a dimension estimation algorithm proposed in Chen et al. (2013), a by-
product of this algorithm is a range of scales where the tangent space at a data point is well
aligned with the principal subspace obtained by a local singular value decomposition. As these
are scales at which the manifold looks locally linear, one can reasonably expect that they are
also the correct scales at which to approximate the manifold.

5.3 Estimating the intrinsic dimension


Knowing the intrinsic dimension of data is important in itself. Additionally, some embedding
algorithms (Isomap, LTSA), as well as all local PCA and Principal d-manifolds algorithms
require the intrinsic dimension d as input.

How hard is dimension estimation? The dimension of a manifold is a non-negative integer,


therefore, intuitively, it should require fewer samples to estimate than a real-valued geometric
parameter. Indeed, it is known Kim et al. (2019), Genovese
n
et al. (2012) that the minimax rate
for dimension estimation are between n−2n and n− D+1 for a well-behaved manifold M. This is
an information-theoretic result, delimiting what is possible: the minimax rate is the best possible
rate for any dimension estimator, on the worst possible distribution for this estimator. In the
case when i.i.d. noise is added to the sample, Koltchinskii (2000) shows that the minimax rate is
exponential, i.e. of order q n for some q < 1. This rate is the probability that dˆ ̸= d; Koltchinskii
(2000) also proposes an estimator.
Unfortunately, the empirical experience belies the optimistic theoretical results. Due primar-
ily to the presence of noise, which does not conform to the above assumptions, and secondarily to
non-uniform sampling, estimating d is a hard problem, of which no satisfactorily robust solutions
have been found yet (see Altan et al. (2020) for some empirical results).

Principles and methods for estimating d An idea that appears in various forms through
the dimension estimation literature is to find a local statistic that scales with d by a known law.
For example, the volume of a ball Br of radius r contained in a manifold M is proportional to
rd . If we take n samples uniformly from M, the number of samples contained in Br , denoted
#Br is proportional to nrd , or equivalently

log #Br = d log r + log n + constant. (9)

This suggests if we fit a line to (log r, log #Br ), the slope of the line would represent d.
Recall that ki,r represents the number of radius r neighbors of data point xi . Hence log ki,r ≈
d log r + constant. This is the idea of Grassberger & Procaccia (1983) who introduced the
correlation dimension estimator given by
log n1 i̸=i′ 1∥xi −xi′ ∥≤r
P
dˆC = lim (10)
r→0 log r
In the above, n1 i̸=i′ 1∥xi −xi′ ∥≤r , where the sum is taken over unordered pairs, is nothing else
P
1
Pn
but 2n i=1 (kir −1); hence, the correlation dimension uses an average number of neighbors. This
estimator is easily computed for the radius neighborhood graph. To sidestep the inconvenient
k
assumption that the sample is uniform over M, other methods consider statistics such as ki,2r i,r

d
2 , which lead to so-called doubling dimensions. (Assouad, 1983)

20
Similarly, the covering number ν(r), representing the minimum number of boxes (in this
instance) of size r needed to cover a manifold, scales like r−d for r small. The Box Counting
dimension (Falconer, 2003a) of an object is defined as

ln ν(r)
dˆBC = lim . (11)
r→0 ln 1
r

If ν(r) is defined by way of balls the above becomes the well-known Haussdorff dimension
(Falconer, 2003b). When dˆBC is estimated from data, the covering number represents the number
of boxes (balls) to cover the data set. Note that for finite n, r cannot become too small, as in
this case, every ball or box will contain a single point. The finite radius r is a scale parameter
trading off bias (which increases with r), and variance (which decreases with r).
All the above estimates converge to the intrinsic dimension d when the data is sampled
from a d-manifold. In practice, and with noise, their properties differ, as well as the amount of
computation they need.
Modern estimators consider other statistics, such as distance to k-th nearest neighbor (Pettis
et al., 1979, Costa et al., 2005), the volume of a spherical cap (Kleindessner & von Luxburg,
2015) (both statistics can be computed without knowing actual distances, just comparisons
between them), or Wasserstein distance between two samples of size n on M, which scales like
n−1/d (Block et al., 2022); the algorithm of Levina & Bickel (2004), analyzed in Farahmand et al.
(2007), proposes a Maximum Likelihood method based on k-nearest neighbor graphs.
An algorithm for dimension estimation in noise is proposed by Chen et al. (2013). This
algorithm is based on local PCA at multiple scales; here, dˆL is the most frequent index of
the maximum eigengap of the local covariance matrix. The main challenge is to establish the
appropriate range of scales r at which the d principal values of the local covariances separate
from the remaining eigenvalues, in noise. The algorithm can be simplified by plugging in the
neighborhood radius selected to optimize the Laplacian estimator, by e.g. Joncas et al. (2017),
see Section 5.2.

5.4 Estimating the Laplace-Beltrami operator


As we discussed in section 4, the Laplace-Beltrami operator ∆M serves as an important tool
to understand the geometry of a manifold M. We have seen that the eigenvectors of ∆M can
be used to embed the data in low dimensions by the DM algorithm. Furthermore, if enough
eigenvectors are computed, the embedding becomes closer to an isometry (Coifman et al., 2005).
Additionally, graph Laplacian estimators of ∆M are used to measure the smoothness of a
function (by 12 hf T Lhf ), to provide regularization in supervised and semi-supervised learning
on manifolds (Belkin et al., 2006, Slepčev & Thorpe, 2019), Bayesian priors (Kirichenko & van
Zanten, 2017), or to define Gaussian Processes on a manifold (Borovitskiy et al., 2020).
On one hand, using graph Laplacian to estimate Laplace-Beltrami operator such as in Coifman
& Lafon (2006) has been long established. Recently, more theoretical results appeared on how
this estimation behaves. he Laplace-Beltrami operator ∆M acting on a twice differentiable T
function f : M → R is defined as ∆M f ≡div grad(f ).
In general, two types of convergence have been studied: pointwise and spectral convergence
under the formulation of Berry & Harlim (2016). Let µ be the Riemannian measure corresponding
to the metric of a d dimensional manifold M, f ∈ C 3 (M) be a real-valued function and q(x) be
the sampling density on M. Further, let f = (f (xi ))ni=1 ∈ Rn . Then ideally, the two convergence
paradigm of a random walk Laplacian Lrw = D−1 Lun defined on an h-nearest neighbor graph
to its limit L∞ are given by

21
n→∞
• Pointwise convergence: E[(D−1 Lun f )i ] −→ cL∞ f (xi ) + O(h2 )
f (x)(L∞ f )(x)q(x)dµ(x)
R
un n→∞
• Spectral convergence: E[ ffLDff ] −→ c x∈M R
f 2 (x)q(x)dµ(x)
+ O(h2 ); this type of
x∈M
convergence matters for spectral embedding algorithms
When the sampling density q is uniform, Belkin & Niyogi (2007) showed that pointwise con-
vergence of random-walk Laplacian holds for L∞ = ∆M from an h−nearest neighbor graph.
Coifman & Lafon (2006) showed that pointwise convergence holds for L∞ = ∆M − ∆M q/q when
q is not uniform. Through the renormalization, L as in algorithm 3 can eliminate the bias term
∆q/q and converge to Laplace-Beltrami operator regardless of sampling density. Ting et al.
(2010) further showed that for K−nearest neighbor graph, the random-walk graph Laplacian
pointwisely converge to ∆M rescaled with q 2/d . For spectral convergence, readers are encour-
aged to consult Belkin & Niyogi (2007), Berry & Sauer (2019), Garcı́a Trillos & Slepčev (2018),
Garcı́a Trillos et al. (2020).
Recently, the limits of a class of manifold learning algorithms to differential operators are
studied. For a specific type called linear smoothing algorithms, these ML algorithms are proved
to converge to a second-order differential operator on M. For example, LE ,DM converges to
Laplacian operator, LTSA,Hessian Eigemaps both converge to the Frobenius norm of Hessian.
Unregularized LLE, on the other hand, fails to converge to any differential operator. Details can
be found in Ting & Jordan (2018).

5.5 Embedding distortions. Is isometric embedding possible?


Figure 5 shows the outputs of various embedding algorithms on a simple 2-manifold M ⊂ R3 .
It is easily seen that the results depend on the algorithm (and parameter choices), as well
as on the input (manifold and sampling density on M). While all are smooth embeddings
the algorithm-dependent distortions – amounting to different coordinate systems – make these
outputs irreproducible and incomparable.
The presence of distortion is commonly observed empirically. Note that the distortions do
not disappear when the sample size n increases or the sampling density is uniform, or even when
the consistent graph and Laplacian are used. They are also not an effect of sampling noise. This
section is concerned with recovering reproducibility, by preserving the intrinsic geometry of the
data.

Attempts at isometric embedding Mathematically, the presence of distortions means that


an embedding F is not isometric. Distortionless, i.e. isometric embedding is possible, as proved
by a famous result of John Nash (Nash embedding theorem, Lee (2003)). Note that for a
smooth embedding, the number of dimensions required is m ≥ d(d + 1)/2, the number of degrees
of freedom of g. The proof of Nash’s theorem is constructive, but not easily amenable to
consistent, numerically stable implementation.
A more recent seminal result is that the DM embedding is isometric for large m (e.g. m → ∞)
(Bérard et al., 1994, Portegies, 2016). While these results are important mathematically, the fact
that the embedding dimension m is required to be large makes them less interesting for data
scientists/defeats the goal of dimension reduction.
Many ML methods focus on promoting isometry in local neighborhoods. Apart from the pre-
viously mentioned Hessian Eigenmaps (Donoho & Grimes, 2003), LTSA(Zhang & Zha, 2004),
the method of Weinberger & Saul (2006) preserve local distances in a Semidefinite Programming
(SDP) framework. Conformal Eigenmap in Sha & Saul (2005) maps triangles in each neigh-
borhood, thus succeeding in preserving angles. The works of Yu & Zhang (2010) and Lin et al.

22
(2013) approach global isometry by means of constructing normal coordinates recursively from a
point p ∈ M, or, respectively, by mutually orthogonal parallel vector fields, and Verma (2011) is
the first attempt to implement Nash’s construction. We note that, with the exception of Verma
(2011), these methods do not guarantee an isometric embedding except in limited special cases.

Preserving isometry by estimating local distortion While finding a practical isometric


embedding algorithm has been unsuccessful so far, removing the distortions is possible for any
well-behaved embedding algorithm by a post-processing approach (McQueen et al., 2016). The
idea is simple and general: given the (distorted) output y1 , . . . yn of an embedding algorithm on
data x1 , . . . xn , one can estimate the distortion incurred at each point. Once the distortions are
known, whenever a distance, angle, or volume is calculated, one applies local corrections that
amount to obtaining the same result as if the embedding was isometric.
This is always possible via the push-forward metric. Let (M, g) be a Riemannian manifold,
and F : M → N = F (M) ⊂ Rm a smooth map, representing, e.g., the limit case of an embedding
algorithm. We can endow N with the push-forward Riemannian metric g̃ of F at point p ∈ M.
Let u, v ∈ TF (p) N be vectors in the tangent space of N at point F (p). Then the push-forward
of g at p is defined by

⟨u, v⟩g̃(Fp ) ≡ ⟨dFp† (u) , dFp† (v)⟩g(p) . (12)

In the above, dFp† is the pseudoinverse of dFp . In matrix notation (12) implies that

g̃(Fp ) ≡ ((dFp )T )† g(p) (dFp ) (13)

with dFp , g(p), g̃(p) are matrices of size d × m, d × d and m × m respectively, and g(p), g̃(p)
positive semidefinite matrices of rank d. When M ⊂ RD , with metric inherited from the ambient
space, g(p) = Id the unit matrix and g̃(p) = ((dFp )T )† (dFp )† . Comparing (12) with (1) it is
easy to see that (F (M), g̃) is isometric with the original (M, g).
Hence, if one computes for each embedding point yi the respective pushforward metric g̃i ∈
Rm×m , then all geometric quantities computed with the points y1 , . . . yn w.r.t. g̃ would preserve
their values in the original data, subject only to sampling noise.
It remains to see how to estimate g̃. A direct way is via (13), using an estimator of dF (p).
Another method Perraul-Joncas & Meila (2013) is via the Laplace-Beltrami operator ∆M , namely
using the Diffusion Maps Laplacian, whose properties and consistency is well studied, as seen in
Section 5.4. To extract g̃, Perraul-Joncas & Meila (2013) applies ∆M to a suitably  chosen set
of test functions fkl , with 1 ≤ k ≤ l ≤ m, where fkl,p = F k − F k (p) F l − F l (p) are pairwise
products of coordinate functions, centered at point p. They show that 21 ∆M fkl,p |(p) = h̃kl (p),
the k, l entry in the inverse pushforward metric at p (algorithmically, on a sample, this operation
can be easily vectorized). To obtain gi the metric at data point i, one computes the rank
d pseudo-inverse (Adi Ben-Israel, 2003) of h̃i by SVD. Note that h̃i itself measures the local
distortion at data point i. The embedding metric g̃ and its SVD offer other insights into
the embedding. For instance, the singular values of g̃ may offer a window into estimating d
by looking for a “singular value gap”. The d singular vectors form an orthonormal basis of the
tangent space Tp F (M) at point p = yi , providing a natural framework for constructing a normal
coordinate chart around p. The non-zero singular values of h̃i yield a measure of the distortion
induced by the embedding around the data point xi (indeed, if the embedding were isometric
to M with the metric inherited from Rm , then the embedding metric g̃ would have exactly d
singular values equal to 1).
This last remark can be used in many ways, such as getting a global distortion for the
embedding, and hence as a tool to compare various embeddings. It can also be used to define

23
an objective function to minimize in order to get a more isometric embedding; such as the
RiemannianRelaxation of McQueen et al. (2016).

6 Applications of manifold learning


6.1 Manifold learning in statistics
In Section 5.4 we mentioned that graph Laplacians, such as L and Lnorm can generate smoothness
functionals; given a function f : M → R, and its values on the data points f = [f (xi )]ni=1 , the
value 21 ⟨Lf , f ⟩ approximates the L2 norm ∥∇f ∥22 on the manifold. This can be used as a
regularizer in supervised or semi-supervised learning. If Lnorm is used instead of L, then the
smoothness is measured w.r.t. sampling distribution on M.
Manifold learning by DM is closely related to spectral clustering (Shi & Malik, 2000, Meilă
& Shi, 2001, Ng et al., 2001, von Luxburg, 2007, Meilă, 2016), as both map the data to low
dimensions by the eigenvectors of a Laplacian. For clustering, it is preferable to use Lrw the
random walks Laplacian, which takes into account the data density and will exaggerate the
clusters. In fact, by mapping the data to lower dimension with Lrw , one can observe a continuum
between separating clusters (if the data is clustered) and smooth embedding (for the data regions
where data lie on a manifold), and even perform simultaneous embedding and clustering. In
such cases, it is important to calculate a sufficient number of eigenvectors: for K clusters, there

(a) Isomap (b) LE (c) LLE

(d) LTSA (e) t-SNE (f) UMAP

Figure 7: The embeddings from Figure 5, with the distortion h̃ estimated at a random subset of
points.

24
will be K − 1 eigenvectors indicating the clustering, and for each cluster, additional eigenvectors
for a low-dimensional mapping of the data in the respective cluster. If fewer eigenvectors are
used, then usually the clusters will be recovered but not the intrinsic geometry inside each cluster.

6.2 Manifold learning for visualization


Embedding algorithms are often used in the sciences for data visualization. The scientist, as
well as the statistician, need to distinguish between an embedding as defined in Section 2, which
preserves the geometric and topological data properties, and other mappings (occasionally also
called “embeddings”) into low dimensions using embedding algorithms. The latter kind of di-
mension reduction is hugely popular, and its value for the sciences cannot be underestimated.
However, the users of dimension reduction for visualization should be cautioned that the scientific
conclusions drawn from these visualizations must be subject to careful additional scrutiny, or to
a more rigorous statistical and geometric analysis. One pitfall is that when data are mapped into
m = 2 or 3 dimensions, for visualization, without an estimation of the intrinsic dimension d, the
mapping may collapse together regions of the data that are not close in the original manifold.
When clusters are present because separating the clusters usually requires at least 2 dimensions,
most of the clusters’ geometric structure is collapsed. Hence, once the data is separated into
clusters, the cluster structure needs to be studied by additional dimension reduction. A second
pitfall is the presence of artifacts – interesting geometric features caused by the embedding algo-
rithm but not supported by the data. These can be clusters (Figure 3), , arms, holes or circles,
and so on.
Before assigning scientific meaning to these features, a researcher should examine whether
they are stable, by repeating the embedding with different initial points, algorithms, and algo-
rithm parameters, as well as by perturbing or resampling the original data. To assess if the
features are not large distortions, visualizing the distortion (Figure 7) can provide a valuable
diagnostic. For example, when a “filament” is produced by stretching a low density region, a
very common effect (see Section 5.1), the estimated distortion will show the stretching (Figure
7), while for a true filament, the distortion will be moderate .

6.3 Manifold learning in the sciences


Astronomy and astrophysics Manifold learning has been used to study data from large
astronomical surveys, like the Sloan Digital Sky Survey (SDSS)2 . The mass distribution in the
universe reveals filaments, i.e. one-dimensional manifolds, and dimension reduction methods,
most often Principal Curves (Section 3 have been used to estimate them Chen et al. (2015)).
Spectra of galaxies are measured in thousands of frequency bands; they contain rich data
about galaxies’ chemical and physical composition. By embedding these spectra in low dimen-
sions, as in Figure 3, one can analyze the main constraints and pathways in the evolution of
galaxies (Vanderplas & Connolly, 2009).

Dynamical systems Dynamical systems described by Ordinary or Partial Differential Equa-


tions are intimately related to manifolds, while they also exhibit multiscale behavior. Extensions
of manifold learning can be used to understand PDE with geometric structure (Nadler et al.,
2006), study the long term behavior of the system or the ensemble of its solutions Dsilva et al.
(2016, 2018).
2 www.sdss.org

25
Chemistry The accurate simulation of atomical and molecular systems plays a major role
in modern chemistry. Molecular Dynamics (MD) simulations from carefully designed, complex
quantic models can take millions of computer hours; however, simulations can still be less ex-
pensive than conducting experiments, and they return data at a level of detail not achievable
in most experiments. Manifold learning is used to discover collective coordinates, i.e. low di-
mensional descriptors that approximate well the larger scale behavior of atomic, molecular, and
other large particle systems (Boninsegna et al., 2015, A. et al., 2012, Noé & Clementi, 2017). In
these examples, the systems can be in equilibrium, or evolving in time, and in the latter case,
the collective coordinates describe the saddle points in the trajectory, or the folding mechanism
of a large molecule (Rohrdanz et al., 2011, Das et al., 2006).
Manifold embedding is also used to create low dimensional maps of families of molecules and
materials by the similarity of their properties (Ceriotti et al., 2013, Isayev et al., 2015).

Biological sciences In neuroscience and the biological sciences, manifold embeddings are
widely used to summarize neural recordings (Connor & Rozell, 2016, Cunningham & Yu, 2014),
to describe cell evolution (Herring et al., 2018)

7 Conclusion
In practice, ML is overwhelmingly used for visualization (Section6) and with small data sets.
But ML can do much more. Efficient software now exists (McQueen et al. (2016), Poličar
et al. (2019),etc) which can embed truly large, high-dimensional data (for example SDSS). In
these cases, ML helps practitioners understand the data, by e.g. its intrinsic dimension, or by
interpreting the manifold coordinates (Koelle et al., 2022, Boninsegna et al., 2015, Vanderplas
& Connolly, 2009). For real data, a manifold learning algorithm has the effect of smoothing the
data and supressing/removing variation orthogonal to the manifold, which can be regarded as
noise, just like in PCA. Finally, again similarly to PCA, ML can effectively reduce the data to
m ≪ D dimensions, while preserving features predictive for future statistical inferences. Some
inferences, such as regression, can be performed on manifold data without manifold estimation, by
for example, local linear regression (Aswani et al., 2011), or via Gaussian Processes (Borovitskiy
et al., 2020). A GP on a manifold can be naturally defined via the Laplacian ∆M .
Even when only visualization is desired, care must be taken to ensure the reproducibility of
the results. The implicit assumption that m = 2 is sufficient for embedding the data should be
validated. Attention to be paid when the embedding is interpreted: are the features observed
really in the data or artifacts of the algorithm?

What we omitted We surveyed the state-of-the-art knowledge on the main problems and
methods of manifold learning, focusing on the algorithms that are proven to recover the manifold
structure through learning a smooth embedding.
Among the topics we had to leave out, manifold learning in noise is perhaps the most im-
portant one. Noise makes ML significantly more difficult, by introducing biases and slowing the
convergence of estimators. This is an active area of research, but the estimation of geometric
quantities like tangent space and reach in the presence of noise have been studied (Aamari &
Levrard (2018, 2019),etc); the theoretical results of manifold recovery in noise were mentioned
in Section 3.
The reach, or injectivity radius τ (M) of manifold measures how close to itself M can be. In
other words, τ (M) is the largest radius a ball can have, so that, for any p ∈ M, if it is tangent
to the manifold in p, it does not intersect M in any other point. Large τ implies larger curvature

26
(a subspace has infinite τ ) and easier estimation of M (Genovese et al., 2012, Fefferman et al.,
2016, Aamari & Levrard, 2018, 2019). A manifold can have borders; ML with borders is studied
for example in Singer & Wu (2012), different convergence rates appear when data are sampled
close to the border.
Another useful task is embedding a new data point x ∈ RD onto an existing embedding
F (M); this is often called Nystrom embedding (e.g. Chatalic et al. (2022)). Conversely, if
y ∈ Rm is a new point on the embedding F (M), obtained e.g. by following a curve in the
low dimensional representation of M, how do we map it back to RD ? This is usually done by
interpolation.

References
A. TG, Ceriotti M, Parrinello M. 2012. Using sketch-map coordinates to analyze and bias molec-
ular dynamics simulations. Proceedings of the National Academy of Science, USA 109:5196—
-201
Aamari E, Levrard C. 2018. Stability and minimax optimality of tangential delaunay complexes
for manifold reconstruction. Discrete & Computational Geometry 59(4):923–971
Aamari E, Levrard C. 2019. Nonasymptotic rates for manifold, tangent space and curvature
estimation. Ann. Stat. 47(1):177–204
Adi Ben-Israel TNEG. 2003. Generalized inverses: Theory and applications. New York: Springer
New York, NY
Altan E, Solla SA, Miller LE, Perreault EJ. 2020. Estimating the dimensionality of the manifold
underlying multi-electrode neural recordings. bioRxiv
Arias-Castro E, Pelletier B. 2013. On the convergence of maximum variance unfolding. Journal
of Machine Learning Research 14:1747–1770

Assouad P. 1983. Plongements lipschitziens dans {{r}}n . Bulletin de la Société Mathématique de


France 111:429–448
Aswani A, Bickel P, Tomlin C. 2011. Regression on manifolds: Estimation of the exterior deriva-
tive. The Annals of Statistics 39(1):48–81

Baraniuk RG, Wakin MB. 2009. Random projections of smooth manifolds. Foundations of Com-
putational Mathematics 9(1):51–77
Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clus-
tering, In Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press

Belkin M, Niyogi P. 2003. Laplacian eigenmaps for dimensionality reduction and data represen-
tation. Neural Computation 15(6):1373–1396
Belkin M, Niyogi P. 2007. Convergence of laplacian eigenmaps. In Advances in Neural Information
Processing Systems 19, eds. B Schölkopf, JC Platt, T Hoffman. MIT Press, 129–136
Belkin M, Niyogi P, Sindhwani V. 2006. Manifold regularization: A geometric framework
for learning from labeled and unlabeled examples. Journal of Machine Learning Research
7(85):2399–2434

27
Bérard P, Besson G, Gallot S. 1994. Embedding Riemannian manifolds by their heat kernel.
Geometric Functional Analysis 4(4):373–398
Bernstein M, de Silva V, Langford JC, Tennenbaum J. 2000. Graph approximations to geodesics
on embedded manifolds. http://web.mit.edu/cocosci/isomap/BdSLT.pdf

Berry T, Harlim J. 2016. Variable bandwidth diffusion kernels. Applied and Computational Har-
monic Analysis 40(1):68–96
Berry T, Sauer T. 2019. Consistent manifold representation for topological data analysis
Block A, Jia Z, Polyanskiy Y, Rakhlin A. 2022. Intrinsic dimension estimation using wasserstein
distance. Journal of Machine Learning Research 23(313):1–37

Boninsegna L, Gobbo G, Noé F, Clementi C. 2015. Investigating molecular kinetics by variation-


ally optimized diffusion maps. Journal of chemical theory and computation 11(12):5947–5960
Borovitskiy V, Terenin A, Mostowsky P, Deisenroth (he/him) M. 2020. Matérn gaussian pro-
cesses on riemannian manifolds, In Advances in Neural Information Processing Systems, eds.
H Larochelle, M Ranzato, R Hadsell, M Balcan, H Lin, vol. 33, pp. 12426–12437, Curran
Associates, Inc.
Böhm JN, Berens P, Kobak D. 2022. Attraction-repulsion spectrum in neighbor embeddings.
Journal of Machine Learning Research 23(95):1–32
Calder J, Trillos NG. 2019. Improved spectral convergence rates for graph laplacians on epsilon-
graphs and k-nn graphs. ArXiv abs/1910.13476
Carreira-Perpiñan MA. 2010. The elastic embedding algorithm for dimensionality reduction,
ICML’10, p. 167–174, Madison, WI, USA: Omnipress
Ceriotti M, Tribello GA, Parrinello M. 2013. Demonstrating the transferability and the descrip-
tive power of sketch-map. Journal of Chemical Theory and Computation 9(3):1521–1532PMID:
26587614
Chatalic A, Schreuder N, Rosasco L, Rudi A. 2022. Nyström kernel mean embeddings, In Proceed-
ings of the 39th International Conference on Machine Learning, eds. K Chaudhuri, S Jegelka,
L Song, C Szepesvari, G Niu, S Sabato, vol. 162 of Proceedings of Machine Learning Research,
pp. 3006–3024, PMLR
Chen G, Little AV, Maggioni M. 2013. Multi-resolution geometric analysis for data in high
dimensions. Boston: Birkhäuser Boston, 259–285
Chen L, Buja A. 2009. Local Multidimensional Scaling for nonlinear dimension reduction, graph
drawing and proximity analysis. Journal of the American Statistical Association 104(485):209–
219
Chen YC, Genovese CR, Wasserman L. 2015. Asymptotic theory for density ridges. The Annals
of Statistics 43(5):1896–1928
Chen YC, Meila M. 2021. The decomposition of the higher-order homology embedding con-
structed from the k-laplacian, In Advances in Neural Information Processing Systems, eds.
M Ranzato, A Beygelzimer, Y Dauphin, P Liang, JW Vaughan, vol. 34, pp. 15695–15709,
Curran Associates, Inc.

28
Chmiela S, Tkatchenko A, Sauceda H, Poltavsky I, Schütt KT, Müller KR. 2017. Machine learn-
ing of accurate energy-conserving molecular force fields. Science Advances
Coifman RR, Lafon S. 2006. Diffusion maps. Applied and Computational Harmonic Analysis
30(1):5–30

Coifman RR, Lafon S, Lee A, Maggioni, Warner, Zucker. 2005. Geometric diffusions as a tool
for harmonic analysis and structure definition of data: Diffusion maps, In Proceedings of the
National Academy of Sciences, pp. 7426–7431
Connor M, Rozell C. 2016. Unsupervised learning of manifold models for neural coding of phys-
ical transformations in the ventral visual pathway, In Neural Information Processing Systems
(NIPS) Workshop, Brains and Bits: Neuroscience Meets Machine Learning. Barcelona, Spain
Costa J, Girotra A, Hero A. 2005. Estimating local intrinsic dimension with k-nearest neighbor
graphs, In IEEE/SP 13th Workshop on Statistical Signal Processing, 2005, pp. 417–422
Cunningham JP, Yu BM. 2014. Dimensionality reduction for large-scale neural recordings. Nature
Neuroscience 16:1500—-1509

Das P, Moll M, Stamati H, Kavraki L, Clementi C. 2006. Low-dimensional, free-energy landscapes


of protein-folding reactions by nonlinear dimensionality reduction. Proceedings of the National
Academy of Sciences 103(26):9885–9890
Diaconis P, Goel S, Holmes S. 2008. Horseshoes in multidimensional scaling and local kernel
methods. The Annals of Applied Statistics 2(3):777 – 807
do Carmo M. 1992. Riemannian geometry. Springer
Donoho DL, Grimes C. 2003. Hessian eigenmaps: Locally linear embedding techniques for high-
dimensional data. Proceedings of the National Academy of Sciences 100(10):5591–5596

Dsilva CJ, Talmon R, Coifman RR, Kevrekidis IG. 2018. Parsimonious representation of non-
linear dynamical systems through manifold learning: A chemotaxis case study. Appl. Comput.
Harmon. Anal. 44(3):759–773
Dsilva CJ, Talmon R, Gear CW, Coifman RR, Kevrekidis IG. 2016. Data-driven reduction
for a class of multiscale fast-slow stochastic dynamical systems. SIAM J. Appl. Dyn. Syst.
15(3):1327–1351
Falconer K. 2003a. Alternative definitions of dimension, chap. 3. John Wiley & Sons, Ltd, 39–58
Falconer K. 2003b. Hausdorff measure and dimension, chap. 2. John Wiley & Sons, Ltd, 27–38
Farahmand Am, Szepesvári C, Audibert JY. 2007. Manifold-adaptive dimension estimation, In
Proceedings of the 24th International Conference on Machine Learning, ICML ’07, p. 265–272,
New York, NY, USA: Association for Computing Machinery
Fefferman C, Mitter S, Narayanan H. 2016. Testing the manifold hypothesis. J. Amer. Math.
Soc. 29(4):983–1049
Garcı́a Trillos N, Gerlach M, Hein M, Slepčev D. 2020. Error estimates for spectral convergence
of the graph laplacian on random geometric graphs toward the laplace–beltrami operator.
Foundations of Computational Mathematics 20(4):827–887

29
Garcı́a Trillos N, Slepčev D. 2018. A variational approach to the consistency of spectral clustering.
Applied and Computational Harmonic Analysis 45(2):239–281
Genovese CR, Perone-Pacifico M, Verdinelli I, Wasserman LA. 2012. Minimax manifold estima-
tion. Journal of Machine Learning Research 13:1263–1291

Giné E, Koltchinskii V. 2006. Concentration inequalities and asymptotic results for ratio type
empirical processes. The Annals of Probability 34(3):1143 – 1216
Goldberg Y, Zakai A, Kushnir D, Ritov Y. 2008. Manifold learning: The price of normalization.
Journal of Machine Learning Research 9(63):1909–1939
Grassberger P, Procaccia I. 1983. Measuring the strangeness of strange attractors. Physica D:
Nonlinear Phenomena 9(1):189–208
Hastie T, Stuetzle W. 1989. Principal curves. Journal of the American Statistical Association
84(406):502–516
Hegde C, Wakin M, Baraniuk R. 2007. Random projections for manifold learning, In Advances
in Neural Information Processing Systems, eds. J Platt, D Koller, Y Singer, S Roweis, vol. 20.
Curran Associates, Inc.
Hein M, Audibert J, von Luxburg U. 2007. Graph laplacians and their convergence on random
neighborhood graphs. Journal of Machine Learning Research 8:1325–1368
Herring CA, Banerjee A, McKinley ET, Simmons AJ, Ping J, et al. 2018. Unsupervised trajectory
analysis of Single-Cell RNA-Seq and imaging data reveals alternative tuft cell origins in the
gut. Cell Syst 6(1):37–51.e9
Hinton GE, Roweis S. 2002. Stochastic neighbor embedding, In Advances in Neural Information
Processing Systems, eds. S Becker, S Thrun, K Obermayer, vol. 15. MIT Press

Im DJ, Verma N, Branson K. 2018. Stochastic neighbor embedding under f-divergences


Isayev O, Fourches D, Muratov EN, Oses C, Rasch K, et al. 2015. Materials cartography: Rep-
resenting and mining materials space using structural and electronic fingerprints. Chemistry
of Materials (27):735–743
I.T.Jolliffe. 2002. Principal component analysis. Springer Series in Statistics. Springer New York,
NY
Jacomy M, Venturini T, Heymann S, Bastian M. 2014. Forceatlas2, a continuous graph layout
algorithm for handy network visualization designed for the gephi software. PLOS ONE 9(6):1–
12

Joncas D, Meila M, McQueen J. 2017. Improved graph laplacian via geometric Self-Consistency.
In Advances in Neural Information Processing Systems 30, eds. I Guyon, UV Luxburg, S Ben-
gio, H Wallach, R Fergus, S Vishwanathan, R Garnett. Curran Associates, Inc., 4457–4466
Kim J, Rinaldo A, Wasserman LA. 2019. Minimax rates for estimating the dimension of a
manifold. J. Comput. Geom. 10(1):42–95

Kirichenko A, van Zanten H. 2017. Estimating a smooth function on a large graph by Bayesian
Laplacian regularisation. Electronic Journal of Statistics 11(1):891 – 915

30
Kleindessner M, von Luxburg U. 2015. Dimensionality estimation without distances, In AISTATS
Kobak D, Linderman G, Steinerberger S, Kluger Y, Berens P. 2020. Heavy-tailed kernels reveal
a finer cluster structure in t-sne visualisations, In Machine Learning and Knowledge Discovery
in Databases, eds. U Brefeld, E Fromont, A Hotho, A Knobbe, M Maathuis, C Robardet, pp.
124–139, Cham: Springer International Publishing

Koelle SJ, Zhang H, Meila M, Chen YC. 2022. Manifold coordinates with physical meaning.
Journal of Machine Learning Research 23(133):1–57
Kohli D, Cloninger A, Mishne G. 2021. Ldle: Low distortion local eigenmaps. Journal of Machine
Learning Research 22(282):1–64

Koltchinskii VI. 2000. Empirical geometry of multivariate data: a deconvolution approach. The
Annals of Statistics 28(2):591 – 629
Kruskal JB. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypoth-
esis. Psychometrika 29(1):1–27

Lee JM. 2003. Introduction to smooth manifolds. Springer-Verlag New York


Levina E, Bickel PJ. 2004. Maximum likelihood estimation of intrinsic dimension, In Advances
in Neural Information Processing Systems 17 NIPS 2004, December 13-18, 2004, Vancouver,
British Columbia, Canada], pp. 777–784
Lin B, He X, Zhang C, Ji M. 2013. Parallel vector field embedding. Journal of Machine Learning
Research 14(90):2945–2977
Linderman GC, Steinerberger S. 2019. Clustering with t-sne, provably. SIAM Journal on Math-
ematics of Data Science 1(2):313–332
Luo C, Safa I, Wang Y. 2009. Approximating gradients for meshes and point clouds via diffusion
metric. Computer Graphics Forum 28(5):1497–1508
McInnes L, Healy J, Melville J. 2018. Umap: Uniform manifold approximation and projection
for dimension reduction. arXiv preprint arXiv:1802.03426
McQueen J, Meila M, Joncas D. 2016. Nearly isometric embedding by relaxation, In Advances
in Neural Information Processing Systems, eds. D Lee, M Sugiyama, U Luxburg, I Guyon,
R Garnett, vol. 29. Curran Associates, Inc.
McQueen J, Meila M, VanderPlas J, Zhang Z. 2016. Megaman: Scalable manifold learning in
python. Journal of Machine Learning Research 17
Meilă M, Shi J. 2001. A random walks view of spectral segmentation, In Proceedings of the
Eighth International Workshop on Artificial Intelligence and Statistics, eds. TS Richardson,
TS Jaakkola, vol. R3 of Proceedings of Machine Learning Research, pp. 203–208, PMLR.
Reissued by PMLR on 31 March 2021.
Meilă M. 2016. Spectral clustering : a tutorial for the 2010 ’ s
Mohammed K, Narayanan H. 2017. Manifold learning using kernel density estimation and local
principal components analysis. arxiv 1709.03615

31
Nadler B, Lafon S, Coifman R, Kevrekidis I. 2006. Diffusion maps, spectral clustering and eigen-
functions of Fokker-Planck operators, In Advances in Neural Information Processing Systems
18, eds. Y Weiss, B Schölkopf, J Platt, pp. 955–962, Cambridge, MA: MIT Press
Ng A, Jordan M, Weiss Y. 2001. On spectral clustering: Analysis and an algorithm, In Advances
in Neural Information Processing Systems, eds. T Dietterich, S Becker, Z Ghahramani, vol. 14.
MIT Press
Noé F, Clementi C. 2017. Collective variables for the study of long-time kinetics from molecular
trajectories: theory and methods. Current Opinion in Structural Biology 43:141–147
Ozertem U, Erdogmus D. 2011. Locally defined principal curves and surfaces. Journal of Machine
Learning Research 12(34):1249–1286
Perraul-Joncas D, Meila M. 2013. Non-linear dimensionality reduction: Riemannian metric esti-
mation and the problem of geometric discovery. ArXiv e-prints
Perrault-Joncas D, Meila M. 2014. Improved graph laplacian via geometric self-consistency. ArXiv
e-prints

Pettis KW, Bailey TA, Jain AK, Dubes RC. 1979. An intrinsic dimensionality estimator from
near-neighbor information. IEEE Transactions on Pattern Analysis and Machine Intelligence
PAMI-1(1):25–37
Poličar PG, Stražar M, Zupan B. 2019. opentsne: a modular python library for t-sne dimension-
ality reduction and embedding. bioRxiv
Portegies JW. 2016. Embeddings of Riemannian manifolds with heat kernels and eigenfunctions.
Communications on Pure and Applied Mathematics 69(3):478–518
Ram P, Lee D, March W, Gray A. 2009. Linear-time algorithms for pairwise statistical prob-
lems, In Advances in Neural Information Processing Systems, eds. Y Bengio, D Schuurmans,
J Lafferty, C Williams, A Culotta, vol. 22. Curran Associates, Inc.
Rohrdanz MA, Zheng W, Maggioni M, Clementi C. 2011. Determination of reaction coordinates
via locally scaled diffusion map. The Journal of chemical physics 134(12)
Rosenberg S. 1997. The laplacian on a riemannian manifold: An introduction to analysis on
manifolds. London Mathematical Society Student Texts. Cambridge University Press
Roweis S, Saul L. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science
290(5500):2323–2326
Sha F, Saul LK. 2005. Analysis and extension of spectral methods for nonlinear dimensionality
reduction, ICML ’05. New York, NY, USA: Association for Computing Machinery

Shi J, Malik J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 22(8):888–905
Singer A. 2006. From graph to manifold laplacian: The convergence rate. Applied and Compu-
tational Harmonic Analysis 21(1):128–134Special Issue: Diffusion Maps and Wavelets

Singer A, Wu HT. 2012. Vector diffusion maps and the connection laplacian. Communications
on Pure and Applied Mathematics 65(8):1067–1144

32
Slepčev D, Thorpe M. 2019. Analysis of $p$-laplacian regularization in semisupervised learning.
SIAM Journal on Mathematical Analysis 51(3):2085–2120
Tenenbaum JB, de Silva V, Langford JC. 2000. A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500):2319–2323

Ting D, Huang L, Jordan MI. 2010. An analysis of the convergence of graph laplacians, In
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1079–
1086
Ting D, Jordan MI. 2018. On nonlinear dimensionality reduction, linear smoothing and autoen-
coding. arXiv: Machine Learning

Ting D, Jordan MI. 2020. Manifold learning via manifold deflation


van der Maaten L. 2014. Accelerating t-sne using tree-based algorithms. Journal of Machine
Learning Research 15(93):3221–3245
van der Maaten L, Hinton G. 2008. Visualizing data using t-sne. Journal of Machine Learning
Research 9:2579–2605
Vanderplas J, Connolly A. 2009. Reducing the dimensionality of data: Locally linear embedding
of sloan galaxy spectra. The Astronomical Journal 138(5):1365
Verma N. 2011. Towards an algorithmic realization of nash ’ s embedding theorem

von Luxburg U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395–416
Weinberger KQ, Saul LK. 2006. An introduction to nonlinear dimensionality reduction by max-
imum variance unfolding, In Proceedings, The Twenty-First National Conference on Artificial
Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference,
July 16-20, 2006, Boston, Massachusetts, USA, pp. 1683–1686, AAAI Press

Yu K, Zhang T. 2010. Improved local coordinate coding using local tangents, In Proceedings of the
27th International Conference on International Conference on Machine Learning, ICML’10,
p. 1215–1222, Madison, WI, USA: Omnipress
Zha H, Zhang Z. 2007. Continuum isomap for manifold learnings. Computational Statistics &
Data Analysis 52(1):184–200

Zhang Y, Gilbert AC, Steinerberger S. 2022. May the force be with you, In 58th Annual Allerton
Conference on Communication, Control, and Computing, Allerton 2022, Monticello, IL, USA,
September 27-30, 2022, pp. 1–8, IEEE
Zhang Y, Steinerberger S. 2021. t-sne, forceful colorings and mean field limits. CoRR
abs/2102.13009
Zhang Z, Zha H. 2004. Principal manifolds and nonlinear dimensionality reduction via tangent
space alignment. SIAM J. Scientific Computing 26(1):313–338

33

You might also like