Hu Et Al 2020
Hu Et Al 2020
https://doi.org/10.1007/s40305-020-00295-9
Abstract
Manifold optimization is ubiquitous in computational and applied mathematics, statis-
tics, engineering, machine learning, physics, chemistry, etc. One of the main challenges
usually is the non-convexity of the manifold constraints. By utilizing the geometry
of manifold, a large class of constrained optimization problems can be viewed as
unconstrained optimization problems on manifold. From this perspective, intrinsic
structures, optimality conditions and numerical algorithms for manifold optimization
are investigated. Some recent progress on the theoretical results of manifold optimiza-
tion is also presented.
Xin Liu’s research was supported in part by the National Natural Science Foundation of China (No.
11971466), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No.
ZDBS-LY-7022), the National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy
of Sciences and the Youth Innovation Promotion Association, CAS.
Zai-Wen Wen’s research was supported in part by the the National Natural Science Foundation of China
(Nos. 11421101 and 11831002), and the Beijing Academy of Artificial Intelligence.
Ya-Xiang Yuan’s research was supported in part by the National Natural Science Foundation of China
(Nos. 11331012 and 11461161005).
B Zai-Wen Wen
[email protected]
Jiang Hu
[email protected]
Xin Liu
[email protected]
Ya-Xiang Yuan
[email protected]
1 Beijing International Center for Mathematical Research, Peking University, Beijing 100871,
China
2 State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and
Systems Science, Chinese Academy of Sciences, Beijing 100190, China
3 University of Chinese Academy of Sciences, Beijing 100190, China
123
200 J. Hu et al.
1 Introduction
P-harmonic flow is used in the color image recovery and medical image analysis. For
instance, in medical image analysis, the human brain is often mapped to a unit sphere
via a conformal mapping, see Fig. 1. By establishing a conformal mapping between
123
A Brief Introduction to Manifold Optimization 201
Fig. 1 Conformal mapping between the human brain and the unit sphere [1]
an irregular surface and the unit sphere, we can handle the complicated surface with
the simple parameterizations of the unit sphere. Here, we focus on the conformal
mapping between genus-0 surfaces. From [2], a diffeomorphic map between two
genus-0 surfaces N1 and N2 is conformal if and only if it is a local minimizer of the
corresponding harmonic energy. Hence, one effective way to compute the conformal
mapping between two genus-0 surfaces is to minimize the harmonic energy of the map.
Before introducing the harmonic energy minimization model and the diffeomorphic
mapping, we review some related concepts on manifold. Let φN1 (x 1 , x 2 ) : R2 →
N1 ⊂ R3 , φN2 (x 1 , x 2 ) : R2 → N2 ⊂ R3 be the localcoordinates on N1 and N2 ,
respectively. The first fundamental form on N1 is g = i j gi j dx i dx j , where gi j =
∂φN1 ∂φN1
∂xi
· ∂ x j , i, j = 1, 2. The first fundamental form on N2 is h = i j h i j dx i dx j ,
∂φ ∂φ
where h i j = ∂ xNi 2 · ∂ xNj2 , i, j = 1, 2. Given a smooth map f : N1 → N2 , whose
local coordinate representation is f (x 1 , x 2 ) = ( f 1 (x 1 , x 2 ), f 2 (x 1 , x 2 )), the density
of the harmonic energy of f is
e( f ) = d f 2
= g i j f ∗ ∂x i , f ∗ ∂x j h ,
i, j=1,2
where (g i j ) is the inverse of (gi j ) and the inner product between f ∗ ∂x i and f ∗ ∂x j is
defined as:
2
∂ fm
2
∂ fn
2
∂ fm ∂ fn
f ∗ ∂x i , f ∗ ∂x j = ∂y , ∂y = h mn .
h ∂ xi m ∂x j n ∂ xi ∂ x j
m=1 n=1 h m,n=1
123
202 J. Hu et al.
1
min E( f ) = e( f )dN1 ,
f ∈S(N1 ,N2 ) 2 N1
where E( f ) is called the harmonic energy of f . Stationary points of E are the harmonic
maps from N1 to N2 . In particular, if N2 = R2 , the conformal map f = ( f 1 , f 2 )
is two harmonic functions defined on N1 . If we consider a p-harmonic map from
n-dimensional manifold M to n-dimensional sphere Sp(n) := {x ∈ Rn+1 | x 2 =
1} ⊂ Rn+1 , the p-harmonic energy minimization problem can be written as
n+1 p/2
1
min E p ( F) = grad f k 2
dM
F(x)=( f 1 (x),··· , f n+1 (x)) p M
k=1
s.t. F(x) ∈ S n , ∀x ∈ M,
1
maxn wi j (1 − xi x j ) s.t. xi2 = 1, i = 1, · · · , n. (2.1)
x∈R 2
i< j
123
A Brief Introduction to Manifold Optimization 203
1
min H (X − C) 2
F s.t. X ii = 1, i = 1, · · · , n, rank (X ) p. (2.4)
X 0 2
Algorithms for solving (2.4) can be found in [3,4]. Similar to the maxcut problem, we
decompose the low-rank matrix X with X = V V , in which V = [V1 , · · · , Vn ] ∈
R p×n . Therefore, problem (2.4) is converted to a quartic polynomial optimization
problem over multiple spheres:
1
min H (V V − C) 2
F s.t. Vi 2 = 1, i = 1, · · · , n.
V ∈R p×n 2
find x ∈ Cn
(2.5)
s.t. |Ax| = b,
where A ∈ Cm×n and b ∈ Rm . This problem plays an important role in X-ray, crys-
tallography imaging, diffraction imaging and microscopy. Problem (2.5) is equivalent
to the following problem, which minimizes the phase variable y and signal variable x
simultaneously:
min Ax − y 22
n mx∈C ,y∈C
s.t. |y| = b.
1
min Ax − Diag{b}u 2
2
x∈Cn ,u∈Cm 2 (2.6)
s.t. |u i | = 1, i = 1, · · · , m.
min u ∗ Mu
u∈Cm (2.7)
s.t. |u i | = 1, i = 1, · · · , m,
123
204 J. Hu et al.
min E(φ),
φ∈S
1
μφ(w) = − ∇ 2 φ(w) + V (w)φ(w) + β|φ(w)|2 φ(w) − L z φ(w), ξ ∈ Rd ,
2
and
|φ(w)|2 dw = 1.
Rd
Utilizing some proper discretization, such as finite difference, sine pseudospectral and
Fourier pseudospectral methods, we obtain a discretized BEC problem
β
M
1 ∗
min f (x) := x Ax + |x j |4 s.t. x 2 = 1,
x∈C M 2 2
j=1
123
A Brief Introduction to Manifold Optimization 205
where M ∈ N, β are given constants and A ∈ C M×M is Hermitian. Consider the case
that x and A are real. Since x x = 1, multiplying the quadratic term of the objective
function by x x, we obtain the following equivalent problem
β
M
1
min f (x) = x Ax x x + |xi |4 s.t. x 2 = 1.
x∈R M 2 2
i=1
The problem above can be also regarded as the best rank-1 tensor approximation of a
fourth-order tensor F [7], with
⎧
⎪
⎪ akl /4, i = j = k = l,
⎪
⎪
⎨ akl /12, i = j, i = k, i = l, k = l,
Fπ(i, j,k,l) = (aii + akk )/12, i = j = k = l,
⎪
⎪
⎪
⎪ aii /2 + β/4, i = j = k = l,
⎩
0, otherwise.
For the complex case, we can obtain a best rank-1 complex tensor approximation
problem by a similar fashion. Therefore, BEC is a polynomial optimization problem
over single sphere.
2.6 Cryo-EM
R̃i c̃i j = R̃ j c̃ ji .
Since the third column of R̃i3 can be represented by the first two columns R̃i1 and R̃i2 as
R̃i3 = ± R̃i1 × R̃i2 , the rotations { R̃i } can be compressed as a 3-by-2 matrix. Therefore,
the corresponding optimization problem is
N
min ρ(Ri ci j , R j c ji ) s.t. Ri Ri = I2 , Ri ∈ R3×2 , (2.8)
Ri
i=1
where ρ is a function to measure the distance between two vectors, Ri are the first
two columns of R̃i , and ci j are the first two entries of c̃i j . In [8], the distance function
is set as ρ(u, v) = u − v 22 . An eigenvector relaxation and SDP relaxation are also
presented in [8].
123
206 J. Hu et al.
Linear eigenvalue decomposition and singular value decomposition are the special
cases of optimization with orthogonality constraints. Linear eigenvalue problem can
be written as
min tr(X AX ) s.t. X X = I , (2.9)
X ∈Rn× p
1 μ
min f μ (X ) := tr(X AX ) + X X−I F,
2
X ∈Rn× p 2 4
123
A Brief Introduction to Manifold Optimization 207
u i1 i2 ···id = U1 (i 1 )U2 (i 2 ) · · · Ud (i d ),
123
208 J. Hu et al.
n3 nd-1
n2 nd
n1
U1 r1
U2
r2
U3 ... rd-2 Ud-1 rd-1
Ud
r0 r1
r2 rd
r3 rd-1
r3 rd-1
r2
r2 rd-2
r1 rd-1
× × × ··· × ×
r1
U1 (i1 ) Ud (id )
U2 (i2 )
U3 (i3 ) Ud-1 (id-1 )
Fig. 3 Graphical representation of a TT tensor of order d with cores Uμ , μ = 1, 2, · · · , d. The first row
is u, and the second row are its entries u i 1 i 2 ···i d
1 1
E ks (X ) := tr(X ∗ L X ) + tr(X ∗ Vion X )
4 2
1 1 1
+ ζl |xi∗ wl |2 + ρ L † ρ + e εxc (ρ),
2 4 2
l i
min E ks (X ) s.t. X ∗ X = I p .
X ∈Cn× p
123
A Brief Introduction to Manifold Optimization 209
Compared to the KS density functional theory, the HF theory can provide a more
accurate model. Specifically, it introduces a Fock exchange operator, which is a fourth-
order tensor by some discretization, V(·) : Cn×n → Cn×n . The corresponding Fock
energy can be expressed as
1 1
E f := V(X X ∗ )X , X = V(X X ∗ ), X X ∗ .
4 4
∗
Hks (ρk )X k+1 = X k+1 k+1 , X k+1 X k+1 = I p , (2.12)
m−1
ρmix = α j ρk− j .
j=0
min Rα 2 ,
α e=1
123
210 J. Hu et al.
(2.12) with Hks (ρmix ) and execute the iteration (2.12). This technique is called charge
mixing. For more details, one can refer to [17–19].
Since SCF may not converge, many researchers have recently developed optimiza-
tion algorithms for the electronic structure calculation that can guarantee convergence.
In [20], the Riemannian gradient method is directly extended to solve the KS total
energy minimization problem. The algorithm complexity is mainly from the calcula-
tion of the total energy and its gradient calculation, and the projection on the Stiefel
manifold. Its complexity at each step is much lower than the linear eigenvalue prob-
lem, and it is easy to be parallelized. Extensive numerical experiments based on the
software packages Octopus and RealSPACES show that the algorithm is often more
efficient than SCF. In fact, the iteration (2.12) of SCF can be understood as an approx-
imate Newton algorithm in the sense that the complicated part of the Hessian of the
total energy is not considered:
1
min q(X ) := tr(X ∗ Hks (ρk )X ) s.t. X ∗ X = I p .
X ∈Cn× p 2
This method is often referred as the KS energy minimization model with temperature
or the ensemble KS energy minimization model (EDFT). Similar to the KS energy
minimization model, by using the appropriate discretization, the wave function can
be represented with X = [x1 , · · · , x p ] ∈ Cn× p . The discretized charge density in
EDFT can be written as
123
A Brief Introduction to Manifold Optimization 211
1
M(X , f ) = tr(diag( f )X ∗ AX ) + ρ L † ρ + e εxc (ρ) + α R( f ).
2
The discretized EDFT model is
min
M(X , f )
X ∈Cn× p , f ∈R p
s.t. X ∗ X = I p , (2.13)
e f = p, 0 f 1.
Although SCF can be generalized to this model, its convergence is still not guaranteed.
An equivalent simple model with only one-ball constraint is proposed in [27]. It is
solved by a proximal gradient method where the terms other than the entropy function
term are linearized. An explicit solution of the subproblem is then derived, and the
convergence of the algorithm is established.
Many optimization problems arising from data analysis are NP-hard integer pro-
grammings. Spherical constraints and orthogonal constraints are often used to obtain
approximate solutions with high quality. Consider optimization problem over the per-
mutation matrices:
min f (X ),
X ∈n
n := {X ∈ Rn×n : X X = In , X 0}.
123
212 J. Hu et al.
min C, X
X
s.t. X ii = 1, i = 1, · · · , n,
0 X i j 1, ∀i, j,
X 0.
123
A Brief Introduction to Manifold Optimization 213
Therefore, the use of the batch standardization ensures that the model does not explode
with large learning rates and that the gradient is invariant to linear scaling during
propagation.
Since BN(cw x) = BN(w x) holds for any constant c , the optimization problem
for deep neural networks using batch normalization can be written as
min L(X ), M = S n 1 −1 × · · · S n m −1 × Rl ,
X ∈M
where L(X ) is the loss function, S n−1 is a sphere in Rn (can also be viewed as a
Grassmann manifold), n 1 , · · · , n m are the dimensions of the weight vectors, m is the
number of weight vectors, and l is the number of remaining parameters to be decided,
including deviations and other weight parameters. For more information, we refer to
[30].
In the traditional PCA, the obtained principle eigenvectors are usually not sparse,
which leads to high computational cost for computing the principle components. Spare
PCA [31] wants to find principle eigenvectors with few nonzero elements. The math-
ematical formulation is
min − tr(X A AX ) + ρ X 1
X ∈Rn× p (2.16)
s.t. X X = I p ,
where X 1 = i j |X i j | and ρ > 0 is a trade-off parameter. When ρ = 0, this
reduces to the traditional PCA problem. For ρ > 0, the term X 1 plays a role to
promote sparsity. Problem (2.16) is a non-smooth optimization problem on the Stiefel
manifold.
min rank (X )
X (2.17)
s.t. X i j = Ai j , (i, j) ∈ ,
where X is the matrix that we want to recover (some of its entries are known) and is
the index set of observed entries. Due to the difficulty of the rank, a popular approach
is to relax it into a convex model using the nuclear norm. The equivalence between
this convex problem and the non-convex problem (2.17) is ensured under certain
conditions. Another way is to use a low-rank decomposition on X and then solve the
123
214 J. Hu et al.
where X 1 = i, j |X i j |. Problem (2.19) is a non-smooth optimization problem on
the fixed-rank matrix manifold. For some related algorithms for (2.18) and (2.19), the
readers can refer to [34,35].
y = a0 x 0 ,
where y ∈ Rm and represents some kind of convolution. Since there are infinitely
many pairs (a0 , x0 ) satisfying this condition, this problem is often ill conditioned. To
overcome this issue, some regularization terms and extra constraints are necessary.
The sphere-constrained sparse blind deconvolution reformulates the problem as
min y−ax 2
2 +μ x 1 s.t. a 2 = 1,
a,x
Since the principle eigenvectors obtained by the traditional PCA may not be sparse,
one can enforce the sparsity by adding nonnegativity constraints. The problem is
formulated as
123
A Brief Introduction to Manifold Optimization 215
where A = [a1 , · · · , ak ] ∈ Rn×k are given data points. Under the constraints, the vari-
able X has at most one nonzero element in each row. This actually helps to guarantee
the sparsity of the principle eigenvectors. Problem (2.20) is an optimization problem
with manifold and nonnegative constraints. Some related information can be found in
[37,38].
123
216 J. Hu et al.
where D f (x)[ξ ] is the derivative of f (γ (t)) at t = 0, γ (t) is any curve on the manifold
that satisfies γ (0) = x and γ̇ (0) = ξ . The Riemannian Hessian Hess f (x) is a mapping
from the tangent space Tx M to the tangent space Tx M:
where D is the Euclidean derivative and PTx M (u) := arg minz∈Tx M x − z 2 denotes
the projection operator to Tx M. When M is a quotient manifold whose total space is
a submanifold of an Euclidean space, the tangent space in the expression (3.3) should
be replaced by its horizontal space. According to (3.1) and (3.2), different Riemannian
metrics will lead to different expressions of Riemannian gradient and Hessian. More
detailed information on the related backgrounds can be found in [42].
We next briefly introduce some typical manifolds, where the Euclidean metric on
the tangent space is considered.
• Sphere [42] Sp(n − 1). Let x(t) with x(0) = x be a curve on sphere, i.e.,
x(t) x(t) = 1 for all t. Taking the derivatives with respect to t, we have
123
A Brief Introduction to Manifold Optimization 217
For a function defined on Sp(n−1) with respect to the Euclidean metric gx (u, v) =
u v, u, v ∈ Tx Sp(n − 1), its Riemannian gradient and Hessian at x can be
represented by
Given a function defined on Ob(n, p) with respect to the Euclidean metric, its
Riemannian gradient and Hessian at X can be represented by
X ∼ Y ⇔ ∃Q ∈ R p× p with Q Q = Q Q = I s.t. Y = X Q.
123
218 J. Hu et al.
[X ] := {Y ∈ Rn× p : Y Y = I , Y ∼ X }.
Then, Grass(n, p) is a quotient manifold of St(n, p), i.e., St(n, p)/ ∼. Due to this
equivalence, a tangent vector ξ of TX Grass(n, p) may have many different repre-
sentations in its equivalence class. To find the unique representation, a horizontal
space [42, Section 3.5.8] is introduced. For a given X ∈ Rn× p with X X = I p ,
the horizontal space is
Here, a function of the horizontal space is similar to the tangent space when
computing the Riemannian gradient and Hessian. We have the projection onto the
horizontal space
PH X Grass(n, p) (Z ) = Z − X X Z .
123
A Brief Introduction to Manifold Optimization 219
where
M̂ = M(∇ 2 f (X )[H ]; X ),
Û p = U p (∇ 2 f (X )[H ]; X ) + PU⊥ ∇ f (X )V p (H ; X )/,
V̂ p = V p (∇ 2 f (X )[H ]; X ) + PV⊥ ∇ f (X )U p (H ; X )/.
• The set of symmetric positive definite matrices [44], i.e., SPD(n) = {X ∈ Rn×n :
X = X , X 0} is a manifold. Its tangent space at X is
TX SPD(n) = {Z ∈ Rn×n : Z = Z }.
TY HFrPSD(n,r ) = {Z ∈ Rn×r : Z Y = Y Z }.
123
220 J. Hu et al.
PTY HFrPSD(n,r ) (Z ) = Z − Y ,
where the skew-symmetric matrix is the unique solution of the Sylvester equation
(Y Y ) + (Y Y ) = Y Z − Z Y . Given a function f with respect to the
Euclidean metric gY (U , V ) = tr(U V ), U , V ∈ TY HFrPSD(n,r ) , its Riemannian
gradient and Hessian can be represented by
grad f (Y ) = ∇ f (Y ),
Hess f (X )[U ] = PTY HFrPSD(n,r ) (∇ 2 f (Y )[U ]), U ∈ TY HFrPSD(n,r ) .
We next present the optimality conditions for manifold optimization problem in the
following form
min f (x)
x∈M
s.t. ci (x) = 0, i ∈ E := {1, · · · , }, (3.5)
ci (x) 0, i ∈ I := { + 1, · · · , m},
where E and I denote the index sets of equality constraints and inequality constraints,
respectively, and ci : M → R, i ∈ E ∪ I are smooth functions on M. We mainly
adopt the notions in [47]. By keeping the manifold constraint, the Lagrangian function
of (3.5) is
L(x, λ) = f (x) − λi ci (x), x ∈ M,
i∈E ∪I
where λi , i ∈ E ∪ I are the Lagrangian multipliers. Here, we notice that the domain
of L is on the manifold M. Let A(x) := E ∪ {i ∈ I : ci (x) = 0}. Then the linear
independence constraint qualifications (LICQ) for problem (3.5) holds at x if and only
if
123
A Brief Introduction to Manifold Optimization 221
Let x ∗ and λi∗ , i ∈ E ∪ I be one of the solution of the KKT conditions (3.6). Similar
to the case without the manifold constraint, we define a critical cone C(x ∗ , λ∗ ) as
⎧
⎪ w ∈ Tx ∗ M,
⎪
⎨
∗ ∗ grad ci (x ∗ ), w = 0, ∀i ∈ E,
w ∈ C(x , λ ) ⇔
⎪
⎪ grad ci (x ∗ ), w = 0, ∀i ∈ A(x ∗ ) ∩ I with λi∗ > 0,
⎩
grad ci (x ∗ ), w 0, ∀i ∈ A(x ∗ ) ∩ I with λi∗ = 0.
Hess L(x ∗ , λ∗ )[w], w 0, ∀w ∈ C(x ∗ , λ∗ ),
Hess L(x ∗ , λ∗ )[w], w > 0, ∀w ∈ C(x ∗ , λ∗ ), w = 0,
Suppose that we have only the manifold constraint, i.e., E ∪I is empty. For a smooth
function f on the manifold M, the optimality conditions take a similar form to the
Euclidean unconstrained case. Specifically, if x ∗ is a first-order stationary point, then
it holds that
grad f (x ∗ ) = 0.
grad f (x ∗ ) = 0, Hess f (x ∗ ) 0.
If x ∗ satisfies
grad f (x ∗ ) = 0, Hess f (x ∗ ) 0,
then x ∗ is a strict local minimum. For more details, we refer the reader to [47].
123
222 J. Hu et al.
123
A Brief Introduction to Manifold Optimization 223
where tk is a well-chosen step size. Similar to the line search method in Euclidean
space, the step size tk can be obtained by the curvilinear search on the manifold. Here,
we take the Armijo search as an example. Given ρ, δ ∈ (0, 1), the monotone and
non-monotone search try to find the smallest nonnegative integer h such that
where
−1
sk−1 = −tk−1 · Txk−1 →xk (grad f (xk−1 )), vk−1 = grad f (xk ) + tk−1 · sk−1 ,
and Txk−1 →xk : Txk−1 M → Txk M denotes an appropriate vector transport mapping
connecting xk−1 and xk ; see [42,58]. When M is a submanifold of an Euclidean space,
the Euclidean differences sk−1 = xk − xk−1 and vk−1 = grad f (xk ) − grad f (xk−1 )
are an alternative choice if the Euclidean inner product is used in (3.10). This choice
is often attractive since the vector transport is not needed [51,54]. We note that the
differences between first- and second-order algorithms are mainly due to their specific
ways of acquiring ξk .
In practice, the computational cost and convergence behavior of different retraction
operators differ a lot. Similarly, the vector transport plays an important role in CG
methods and quasi-Newton methods (we will introduce them later). There are many
studies on the retraction operators and vector transports. Here, we take the Stiefel
manifold St(n, p) as an example to introduce several different retraction operators at
the current point X for a given step size τ and descent direction −D.
• Exponential map [59]
geo −X D −R Ip
R X (−τ D) = X , Q exp τ ,
R 0 0
123
224 J. Hu et al.
τ −1
R X (−τ D) = X − τ U I2 p + V U V X,
wy
(3.11)
2
−1
wy 1 1
Tη X (ξ X ) = I − Wη X I + Wη X ξ X , Wη X = PX η X X − X η X PX ,
2 2
The computational cost is lower than the Cayley transform, but the Cayley trans-
form may give a better approximation to the exponential map [60]. The associated
vector transport is then defined as [61]
Tη X ξ X = Y + (I − Y Y )ξ X (Y (X + η X ))−1 ,
pd
qr
R X (−τ D) = qr(X − τ D).
It can be seen as an approximation of the polar decomposition. The main cost is the
QR decomposition of an n-by- p matrix. The associated vector transport is defined
as [42, Example 8.1.5]
123
A Brief Introduction to Manifold Optimization 225
Recently, these retractions are also used to design the neural network structure and
solve deep learning tasks [62,63].
The vector transport above requires an associated retraction. Removing the depen-
dence of the retraction, a new class of vector transports is introduced in [64].
Specifically, a jointly smooth operator L(x, y) : Tx M → Ty M is defined. In addi-
tion, L(x, x) is required to be an identity for all x. For a d-dimensional submanifold
M of n-dimensional Euclidean space, two popular vector transports are defined by
the projection [42, Section 8.1.3]
123
226 J. Hu et al.
Then, every accumulation point x∗ of the sequence {xk } is a stationary point of problem
(1.1), i.e., it holds grad f (x∗ ) = 0.
Proof At first, by using grad f (xk ), ηk xk = − grad f (xk ) 2xk < 0 and applying [67,
Lemma 1.1], we have f (xk ) Ck and xk ∈ {x ∈ M : f (x) f (x0 )} for all k ∈ N.
Next, due to
there always exists a positive step size tk ∈ (0, γk ] satisfying the monotone and non-
monotone Armijo conditions (3.8) and (3.9), respectively. Now, let x∗ ∈ M be an
arbitrary accumulation point of {xk } and let {xk } K be a corresponding subsequence
that converges to x∗ . By the definition of Ck+1 and (3.8), we have
Q k Ck + f (xk+1 ) (Q k + 1)Ck
Ck+1 = < = Ck .
Q k+1 Q k+1
Hence, {Ck } is monotonically decreasing and converges to some limit C̄ ∈ R ∪ {−∞}.
Using f (xk ) → f (x∗ ) for K $ k → ∞, we can infer C̄ ∈ R and thus, we obtain
∞
∞
ρtk grad f (xk ) 2
xk
∞ > C0 − C̄ = Ck − Ck+1 .
Q k+1
k=0 k=0
k
Due to Q k+1 = 1 + Q k = 1 + + 2 Q k−1 = · · · = i=0 i < (1 − )−1 , this
implies {tk grad f (xk ) xk } → 0. Let us now assume grad f (x∗ ) = 0. In this case,
2
for all k ∈ K sufficiently large. Since the sequence {ηk } K is bounded, the rest of the
proof is now identical to the proof of [42, Theorem 4.3.1]. In particular, applying the
mean value theorem in (3.12) and using the continuity of the Riemannian metric, we
can easily derive a contradiction. We refer to [42] for more details.
A gradient-type algorithm usually is fast in the early iterations, but it often slows
down or even stagnates when the generated iterations are close to an optimal solu-
tion. When a high accuracy is required, second-order-type algorithms may have its
advantage.
By utilizing the exact Riemannian Hessian and different retraction operators, Rie-
mannian Newton methods, trust-region methods, adaptive regularized Newton method
123
A Brief Introduction to Manifold Optimization 227
have been proposed in [42,51,68,69]. When the second-order information is not avail-
able, the quasi-Newton-type method becomes necessary. As in the Riemannian CG
method, we need the vector transport operator to compare different tangent vectors
from different tangent spaces. In addition, extra restrictions on the vector transport and
the retraction are required for better convergence property or even convergence [61,
64,70–74]. Non-vector-transport-based quasi-Newton method is also explored in [75].
1
min m k (ξ ) := grad f (xk ), ξ xk + Hess f (xk )[ξ ], ξ xk s.t. ξ xk k ,
ξ ∈Txk M 2
(3.13)
where k is the trust-region radius. In [76], extensive methods for solving (3.13) are
summarized. Among them, the Steihaug CG method, also named as truncated CG
method, is most popular due to its good properties and relatively cheap computational
cost. By solving this trust-region subproblem, we obtain a direction ξk ∈ Txk M
satisfying the so-called Cauchy decrease. Then, a trial point is computed as z k =
Rxk (ξk ), where the step size is chosen as 1. To determine the acceptance of z k , we
compute the ratio between the actual reduction and the predicted reduction
When ρk is greater than some given parameter 0 < η1 < 1, z k is accepted. Otherwise,
z k is rejected. To avoid the algorithm stagnating at some feasible point and promote
the efficiency as well, the trust-region radius is also updated based on ρk . The full
algorithm is presented in Algorithm 2.
For the global convergence, the following assumptions are necessary for second-
order-type algorithms on manifold.
123
228 J. Hu et al.
Assumption 3.4 (a). The function f is continuous differentiable and bounded from
below on the level set {x ∈ M : f (x) f (x0 )}.
(b). There exists a constant βHess > 0 such that
Assumption 3.5 There exists two constants βRL > 0 and δRL > 0 such that for all
x ∈ M and ξ ∈ Tx M with ξ = 1,
d
f ◦ Rx (tξ ) |t=τ − d f ◦ Rx (tξ ) |t=0 τβRL , ∀τ δRL .
dt dt
Then, the global convergence to a stationary point [42, Theorem 7.4.2] is presented
as follows:
Theorem 3.6 Let {xk } be a sequence generated by Algorithm 2. Suppose that Assump-
tions 3.4 and 3.5 hold, then
123
A Brief Introduction to Manifold Optimization 229
where U ∈ Txk M, PT⊥x M := I − PTxk M is the projection onto the normal space and
k
the Weingarten map Wx (·, v) with v ∈ Tx⊥k M is a symmetric linear operator which is
related to the second fundamental form of M. To solve (3.15), a modified CG method
is proposed in [51] to solve the Riemannian Newton equation at xk ,
Since Hess m̂ k (xk ) may not be positive definite, CG may be terminated if a direction
with negative curvature, says dk , is encountered. Different from the truncated CG
method used in RTR, a linear combination of sk (the output of the truncated CG
method) and the negative curvature direction dk is used to construct a descent direction
sk + τk dk , if dk = 0, dk , grad m̂ k (xk ) x
ξk = with τk := k
. (3.17)
sk , if dk = 0, dk , Hess m̂ k (xk )[dk ] x
k
f (z k ) − f (xk )
ρ̂k = . (3.18)
m̂ k (z k )
If ρ̂k η1 > 0, then the iteration is successful and we set xk+1 = z k ; otherwise, the
iteration is not successful and we set xk+1 = xk , i.e.,
zk , if ρ̂k η1 ,
xk+1 = (3.19)
xk , otherwise.
123
230 J. Hu et al.
where 0 < η1 η2 < 1 and 0 < γ0 < 1 < γ1 γ2 . These parameters determine how
aggressively the regularization parameter is adjusted when an iteration is successful or
unsuccessful. Putting these features together, we obtain Algorithm 4, which is dubbed
as ARNT.
Algorithm 4: An Adaptive Regularized Newton Method
Step 1 Choose a feasible initial point x0 ∈ M and an initial regularization
parameter σ0 > 0. Choose 0 < η1 η2 < 1, 0 < γ0 < 1 < γ1 γ2 .
Set k := 0.
Step 2 while stopping conditions not met do
Step 3 Compute a new trial point z k by doing Armijo search along ξk
obatined by Algorithm 3.
Step 4 Compute the ratio ρ̂k via (3.18).
Step 5 Update xk+1 from the trial point z k based on (3.19).
Step 6 Update σk according to (3.20).
Step 7 k ← k + 1.
We next present the convergence property of Algorithm 4 with the inexact Euclidean
Hessian starting from a few assumptions.
Assumption 3.7 Let {xk } be generated by Algorithm 4 with the inexact Euclidean
Hessian Hk .
(A.1) The gradient ∇ f is Lipschitz continuous on the convex hull of the manifold
M – denoted by conv(M), i.e., there exists L f > 0 such that
∇ f (x) − ∇ f (y) L f x − y , ∀ x, y ∈ conv(M).
ξ 2
ξ 2
xk ξ 2
, ξ ∈ Txk M,
for all k ∈ N.
We note that the assumptions (A.2) and (A.4) hold if f is continuous differentiable
and the level set {x ∈ M : f (x) f (x0 )} is compact.
The global convergence to an stationary point can be obtained.
Theorem 3.8 Suppose that Assumptions 3.4 and 3.7 hold. Then, either
123
A Brief Introduction to Manifold Optimization 231
Bk+1 sk = yk ,
where sk = T Sαk ξk αk ξk and yk = βk−1 grad f (xk+1 ) − T Sαk ξk grad f (xk ) with parameter
βk . Here, αk and ξk is the step size and the direction used in the kth iteration. T S is an
isometric vector transport operator by the differentiated retraction R, i.e.,
T Sξx u x , T Sξx vx R = u x , vx x .
x (ξx )
ξk xk
T Sξk ξk = βk T Rξk ξk , βk = ,
T R ξk ξ k R xk (ξk )
123
232 J. Hu et al.
where T Rξk ξk = d
dt Rxk (tξk ) |t=1 . Then, the scheme of the Riemannian BFGS is
B̂k sk (B̂k sk ) yk yk
Bk+1 = B̂k − +
, (3.21)
(B̂k sk ) sk yk sk
−1
where a : Tx M → R : v → a, v x and B̂k = T Sαk ξk αk ξk ◦ Bk ◦ T Sαk ξk αk ξk is
from Txk+1 M to Txk+1 M. With this choice of βk and the isometric property of T S , we
can guarantee the positive definiteness of Bk+1 . After obtaining the new approximation
Bk+1 , the Riemannian BFGS method solves the following linear system
Hess f (xk )[U ] = PTxk M (∇ 2 f (xk )[U ]) + Wxk (U , PT⊥x M (∇ f (xk ))), U ∈ Txk M,
k
where the second term Wxk (U , PT⊥x M (∇ f (xk ))) is often much cheaper than the first
k
term PTxk M (∇ 2 f (xk )[U ]). Similar to the quasi-Newton methods in unconstrained
123
A Brief Introduction to Manifold Optimization 233
nonlinear least square problems [78] [79, Chapter 7], we can focus on the construction
of an approximation of the Euclidean Hessian ∇ 2 f (xk ) and use exact formulations
of remaining parts. Furthermore, if the Euclidean Hessian itself consists of cheap and
expensive parts, i.e.,
∇ 2 f (xk ) = Hc (xk ) + He (xk ), (3.22)
where the computational cost of He (xk ) is much more expensive than Hc (xk ), an
approximation of ∇ 2 f (xk ) is constructed as
Hk = Hc (xk ) + Ck , (3.23)
To explain the differences between the two quasi-Newton algorithms more straight-
forwardly, we take the HF total energy minimization problem (2.10) as an example.
From the calculation in [75], we have the Euclidean gradients
∇ E ks (X ) = Hks (X )X , ∇ E hf (X ) = Hhf (X )X ,
where Hks (X ) := 21 L + Vion + l ζl wl wl∗ + Diag((L † )ρ) + Diag(μxc (ρ)∗ e) and
Hhf (X ) = Hks (X ) + V(X X ∗ ). The Euclidean Hessian of E ks and E f along a matrix
123
234 J. Hu et al.
U ∈ Cn× p are
∂ 2 εxc !
∇ E ks (X )[U ] = Hks (X )U + Diag
2
L +
†
e ( X̄ U + X Ū )e X ,
∂ρ 2
∇ 2 E f (X )[U ] = V(X X ∗ )U + V(XU ∗ + U X ∗ )X .
Hk = ∇ 2 E ks + Ck .
For problems arising from machine learning, the objective function f is often a
summation of a finite number of functions f i , i = 1, · · · , m, namely,
m
f (x) = f i (x).
i=1
For unconstrained situations, there are many efficient algorithms, such as Adam, Ada-
grad, RMSProp, Adelta and SVRG. One can refer to [80]. For the case with manifold
constraints, combining with retraction operators and vector transport operator, these
algorithms can be well generalized. However, in the implementation, due to the con-
siderations of the computational costs of different parts, they may have different
versions. Riemannian stochastic gradient method is first developed in [81]. Later,
a class of first-order methods and their accelerations are investigated for geodesically
convex optimization in [82,83]. With the help of parallel translation or vector transport,
Riemannian SVRG methods are generalized in [84,85]. In consideration of the com-
putational cost of the vector transport, non-vector transport-based Riemannian SVRG
is proposed in [86]. Since an intrinsic coordinate system is absent, the coordinate-
wise update on manifold should be further investigated. A compromised approach for
Riemannian adaptive optimization methods on product manifolds is presented in [87].
Here, the SVRG algorithm [86] is taken as an example. At the current point X s,k , we
first calculate the full gradient G(X s,k ), then randomly sample a subscript from 1 to m
and use this to construct a stochastic gradient with reduced variance as G(X s,k , ξs,k ) =
123
A Brief Introduction to Manifold Optimization 235
!
∇ f (X s,0 ) + ∇ f is,k (X s,k ) − ∇ f is,k (X s,0 ) , finally move along this direction with a
given step size to next iteration point
For Riemannian SVRG [86], after obtaining the stochastic gradient with reduced
Euclidean variance, it first projects this gradient to the tangent space
for a submanifold M. We note that the tangent space should be replaced by the
horizontal space when M is a quotient manifold. Then, the following retraction step
is executed to get the next feasible point. The detailed version is outlined in Algo-
rithm 7.
As shown in Sects. 2.11 to 2.15, many practical problems are with non-smooth
objective function and manifold constraints, i.e.,
123
236 J. Hu et al.
1
min grad g(xk ), d + d 2
F + h(xk + d) s.t. d ∈ Txk M, (3.24)
d 2t
where t > 0 is a step size and M denotes the Stiefel manifold. Given a retraction R,
problem (3.24) can be seen as a first-order approximation of f (Rxk (d)) near the zero
element 0xk on Txk M. From the Lipschitz continuous property of h and the definition
of R, we have
Then, the next step is to solve (3.24). Since (3.24) is convex and with linear constraints,
the KKT conditions are sufficient and necessary for the global optimality. Specifically,
we have
123
A Brief Introduction to Manifold Optimization 237
search along dk with Rxk , the decrease on f is guaranteed and the global convergence
is established.
The complexity analysis of the Riemannian gradient method and the Riemannian
trust-region method has been studied in [108]. Similar to the Euclidean uncon-
strained optimization, the Riemannian gradient method (using a fixed step size or
Armijo curvilinear search) converges to grad f (x) x ε up to O(1/ε2 ) steps.
Under mild assumptions, a modified√Riemannian trust-region method converges to
grad f (x) x ε, Hess f (x) − ε I at most O(max{1/ε1.5 , 1/ε2.5 }) iterations.
For objective functions with multi-block convex but non-smooth terms, an ADMM
of complexity of O(1/ε4 ) is proposed in [105]. For the cubic regularization meth-
ods on the Riemannian manifold, √ recent studies [109,110] show a convergence to
grad f (x) x ε, Hess f (x) − ε I with complexity of O(1/ε1.5 ).
For a convex function in the Euclidean space, any local minimum is also a global
minimum. An interesting extension is the geodesic convexity of functions. Specifically,
a function defined on manifold is said to be geodesically convex if it is convex along any
geodesic. Similarly, a local minimum of a geodesically convex function on manifold
is also a global minimum. Naturally, a question is how to distinguish the geodesically
convex function.
Definition 4.1 Given a Riemannian manifold (M, g), a set K ⊂ M is called g-fully
geodesic, if for any p, q ∈ K, any geodesic γ pq is located entirely in K.
For example, revise the set {P ∈ Sn++ | det(P) = c} with a positive constant c is not
convex in Rn×n , but is a fully geodesic set [111] of Riemannian manifolds (Sn++ , g),
where the Riemannian metric g at P is g P (U , V ) := tr(P −1 U P −1 V ). Now we
present the definition of the g-geodesically convex function.
Definition 4.2 Given a Riemannian manifold (M, g) and a g-fully geodesic set K ⊂
M, a function f : K → R is g-geodesically convex if for any p, q ∈ K and any
geodesic γ pq : [0, 1] → K connecting p, q, it holds:
A g-fully geodesically convex function may not be convex. For example, f (x) :=
(log x)2 , x ∈ R+ is not convex in the Euclidean space, but is convex with respect to
the manifold (R+ , g), where gx (u, v) := ux −1 v.
Therefore, for a specific function, it is of significant importance to define a proper
Riemannian metric to recognize the geodesic convexity. A natural problem is, given
123
238 J. Hu et al.
In [112,113], several classical theoretical problems from KSDFT are studied. Under
certain conditions, the equivalence between KS energy minimization problems and
KS equations are established. In addition, a lower bound of nonzero elements of the
charge density is also analyzed. By treating the KS equation as a fixed point equation
with respect to a potential function, the Jacobian matrix is explicitly derived using the
spectral operator theory and the theoretical properties of the SCF method are analyzed.
It is proved that the second-order derivatives of the exchange-correlation energy are
uniformly bounded if the Hamiltonian has a sufficiently large eigenvalue gap. More-
over, SCF converges from any initial point and enjoys a local linear convergence rate.
Related results can be found in [22–24,56,114,115].
Specifically, consider the real case of KS equation (2.11), we define the potential
function
V := V(ρ) = L † ρ + μxc (ρ) e (4.1)
and
1
H (V ) := L+ ζl wl wl + Vion + Diag(V ). (4.2)
2
l
Then, we have Hks (ρ) = H (V ). From (2.11), X are the eigenvectors corresponding
to the p-smallest eigenvalues of H (V ), which is dependent on V . Then, a fixed point
mapping for V can be written as
where α is an appropriate step size. Under some mild assumptions, SCF converges
with a local linear convergence rate.
123
A Brief Introduction to Manifold Optimization 239
2
0<α< .
2 − b1
Then, {Vk } converges to a solution of the KS equation (2.11), and its convergence rate
is not worse than |1 − α| + α(1 − b1 ).
In the Euclidean space, a common way to escape the local minimum is to add white
noise to the gradient flow, which leads to a stochastic differential equation
where B(t) is the standard n-by- p Brownian motion. A generalized noisy gradient
flow on the Stiefel manifold is investigated in [116]
where BM (t) is the Brownian motion on the manifold M := St(n, p). The construc-
tion of a Brownian motion is then given in an extrinsic form. Theoretically, it can
converge to the global minima by assuming second-order continuity.
For community detection problems, a commonly used model is called the degree-
correlated stochastic block model (DCSBM). It assumes that there are no overlaps
between nodes in different communities. Specifically, the hypothesis node set [n] =
{1, · · · , n} contains k communities, {C1∗ , · · · , Ck∗ } satisfying
123
240 J. Hu et al.
Theorem 4.4 Define G a = i∈Ca∗ θi , Ha = kb=1 Bab G b , f i = Ha θi . Let U ∗ and
∗ be global optimal solutions for (2.15) and (2.14), respectively, and define =
U ∗ (U ∗ ) − ∗ (∗ ) . Suppose that max1a<bk BHaba H+δb < λ < min1ak Baa
Ha2
−δ
for
some δ > 0. Then, with high probability, we have
C0 Baa
1,θ 1 + max f 1 ( n f 1 + n),
δ 1ak Ha2
where the constant C0 > 0 is independent with problem scale and parameter selec-
tions.
Consider the SDP relaxation (2.2) √ and the non-convex relaxation problem with
low-rank constraints (2.3). If p 2n, the composition of a solution V∗ √ of (2.3),
i.e., V∗ V∗ , is always an optimal solution of SDP (2.2) [117–119]. If p 2n, for
almost all matrices C, problem (2.3) has a unique local minimum and this minimum is
also a global minimum of the original problem (2.1) [120]. The relationship between
solutions of the two problems (2.2) and (2.3) is presented in [121]. Define SDP(C) =
max{ C, X : X 0, X ii = 1, i ∈ [n]}. A point V ∈ Ob( p, n) is called an ε-
approximate concave point of (2.3), if
U , Hess f (V )[U ] ε U V,
2
∀U ∈ TV Ob( p, n),
where f (V ) = C, V V . The following theorem [121, Theorem 1] tells the approx-
imation quality of an ε-approximate concave point of (2.3).
1 n
tr(C V V ) SDP(C) − (SDP(C) + SDP(−C)) − ε. (4.6)
p−1 2
1
Q z 2 ε.
n
123
A Brief Introduction to Manifold Optimization 241
C = {X ∈ Sn : A(X ) = b, X 0}.
Since M is non-convex, there may exist many non-global local minima of (4.8). It
is claimed in [123] that each local minimum of (4.8) maps to a global minimum of
(4.7) if p( p+1)
2 > m. By utilizing the optimality theory of manifold optimization, any
second-order stationary point can be mapped to a global minimum of (4.7) under mild
assumptions [124]. Note that (4.9) is generally not a manifold. When the dimension
of the space spanned by {A1 Y , · · · , Am Y }, denoted by rank A, is fixed for all Y , M p
defines a Riemannian manifold. Hence, we need the following assumptions.
Assumption 4.7 For a given p such that M p is not empty, assume at least one of the
following conditions are satisfied.
(SDP.1) {A1 Y , · · · , Am Y } are linearly independent in Rn× p for all Y ∈ M p
(SDP.2) {A1 Y , · · · , Am Y } span a subspace of constant dimension in Rn× p for all Y
in an open neighborhood of M p ∈ Rn× p .
By comparing the optimality conditions of (4.8) and the KKT conditions of (4.7), the
following equivalence between (4.7) and (4.8) is established in [124, Theorem 1.4].
123
242 J. Hu et al.
n
n
max tr Cij Oi O
j , (4.10)
O1 ,··· ,Od ∈Od
i=1 j=1
Vi = P(X i R),
where
⎡ ⎤
1
d
α(d) := E ⎣ σ j (Z )⎦ ,
d
j=1
123
A Brief Introduction to Manifold Optimization 243
5 Conclusions
Acknowledgements The authors are grateful to the associate editor and two anonymous referees for their
detailed and valuable comments and suggestions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
[1] Lai, R., Wen, Z., Yin, W., Gu, X., Lui, L.M.: Folding-free global conformal mapping for genus-0
surfaces by harmonic energy minimization. J. Sci. Comput. 58, 705–725 (2014)
[2] Schoen, R.M., Yau, S.-T.: Lectures on Harmonic Maps, vol. 2. American Mathematical Society,
Providence (1997)
[3] Simon, D., Abell, J.: A majorization algorithm for constrained correlation matrix approximation.
Linear Algebra Appl. 432, 1152–1164 (2010)
[4] Gao, Y., Sun, D.: A majorized penalty approach for calibrating rank constrained correlation matrix
problems, tech. report, National University of Singapore (2010)
[5] Waldspurger, I., d’Aspremont, A., Mallat, S.: Phase recovery, maxcut and complex semidefinite
programming. Math. Program. 149, 47–81 (2015)
[6] Cai, J.-F., Liu, H., Wang, Y.: Fast rank-one alternating minimization algorithm for phase retrieval.
J. Sci. Comput. 79, 128–147 (2019)
[7] Hu, J., Jiang, B., Liu, X., Wen, Z.: A note on semidefinite programming relaxations for polynomial
optimization over a single sphere. Sci. China Math. 59, 1543–1560 (2016)
[8] Singer, A., Shkolnisky, Y.: Three-dimensional structure determination from common lines in cryo-
em by eigenvectors and semidefinite programming. SIAM J. Imaging Sci. 4, 543–572 (2011)
[9] Liu, X., Wen, Z., Zhang, Y.: An efficient Gauss–Newton algorithm for symmetric low-rank product
matrix approximations. SIAM J. Optim. 25, 1571–1608 (2015)
[10] Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing
dominant singular value decompositions. SIAM J. Sci. Comput. 35, A1641–A1668 (2013)
[11] Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace com-
putation. J. Sci. Comput. 66, 1175–1203 (2016)
[12] Wen, Z., Zhang, Y.: Accelerating convergence by augmented Rayleigh–Ritz projections for large-
scale eigenpair computation. SIAM J. Matrix Anal. Appl. 38, 273–296 (2017)
123
244 J. Hu et al.
[13] Zhang, J., Wen, Z., Zhang, Y.: Subspace methods with local refinements for eigenvalue computation
using low-rank tensor-train format. J. Sci. Comput. 70, 478–499 (2017)
[14] Oja, E., Karhunen, J.: On stochastic approximation of the eigenvectors and eigenvalues of the expec-
tation of a random matrix. J. Math. Anal. Appl. 106, 69–84 (1985)
[15] Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. Int. Conf.
Mach. Learn. 144–152 (2015)
[16] Li, C.J., Wang, M., Liu, H., Zhang, T.: Near-optimal stochastic approximation for online principal
component estimation. Math. Program. 167, 75–97 (2018)
[17] Pulay, P.: Convergence acceleration of iterative sequences. The case of SCF iteration. Chem. Phys.
Lett. 73, 393–398 (1980)
[18] Pulay, P.: Improved SCF convergence acceleration. J. Comput. Chem. 3, 556–560 (1982)
[19] Toth, A., Ellis, J.A., Evans, T., Hamilton, S., Kelley, C., Pawlowski, R., Slattery, S.: Local improve-
ment results for Anderson acceleration with inaccurate function evaluations. SIAM J. Sci. Comput.
39, S47–S65 (2017)
[20] Zhang, X., Zhu, J., Wen, Z., Zhou, A.: Gradient type optimization methods for electronic structure
calculations. SIAM J. Sci. Comput. 36, C265–C289 (2014)
[21] Wen, Z., Milzarek, A., Ulbrich, M., Zhang, H.: Adaptive regularized self-consistent field iteration
with exact Hessian for electronic structure calculation. SIAM J. Sci. Comput. 35, A1299–A1324
(2013)
[22] Dai, X., Liu, Z., Zhang, L., Zhou, A.: A conjugate gradient method for electronic structure calcula-
tions. SIAM J. Sci. Comput. 39, A2702–A2740 (2017)
[23] Zhao, Z., Bai, Z.-J., Jin, X.-Q.: A Riemannian Newton algorithm for nonlinear eigenvalue problems.
SIAM J. Matrix Anal. Appl. 36, 752–774 (2015)
[24] Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the Stiefel manifold, II: computation.
Sci. China Math. 58, 1549–1566 (2015)
[25] Gao, B., Liu, X., Chen, X., Yuan, Y.: A new first-order algorithmic framework for optimization
problems with orthogonality constraints. SIAM J. Optim. 28, 302–332 (2018)
[26] Lai, R., Lu, J.: Localized density matrix minimization and linear-scaling algorithms. J. Comput.
Phys. 315, 194–210 (2016)
[27] Ulbrich, M., Wen, Z., Yang, C., Klockner, D., Lu, Z.: A proximal gradient method for ensemble
density functional theory. SIAM J. Sci. Comput. 37, A1975–A2002 (2015)
[28] Jiang, B., Liu, Y.-F., Wen, Z.: L_p-norm regularization algorithms for optimization over permutation
matrices. SIAM J. Optim. 26, 2284–2313 (2016)
[29] Zhang, J., Liu, H., Wen, Z., Zhang, S.: A sparse completely positive relaxation of the modularity
maximization for community detection. SIAM J. Sci. Comput. 40, A3091–A3120 (2018)
[30] Cho, M., Lee, J.: Riemannian approach to batch normalization. Adv. Neural Inf. Process. Syst. 5225–
5235 (2017). https://papers.nips.cc/paper/7107-riemannian-approach-to-batch-normalization.pdf
[31] Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the
lasso. J. Comput. Graph. Stat. 12, 531–547 (2003)
[32] Wen, Z., Yin, W., Zhang, Y.: Solving a low-rank factorization model for matrix completion by a
nonlinear successive over-relaxation algorithm. Math. Program. Comput. 4, 333–361 (2012)
[33] Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23,
1214–1236 (2013)
[34] Wei, K., Cai, J.-F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix
recovery. SIAM J. Matrix Anal. Appl. 37, 1198–1222 (2016)
[35] Cambier, L., Absil, P.-A.: Robust low-rank matrix completion by Riemannian optimization. SIAM
J. Sci. Comput. 38, S440–S460 (2016)
[36] Zhang, Y., Lau, Y., Kuo, H.-w., Cheung, S., Pasupathy, A., Wright, J.: On the global geometry of
sphere-constrained sparse blind deconvolution. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit. 4894–4902 (2017)
[37] Zass, R., Shashua, A.: Nonnegative sparse PCA. Adv. Neural Inf. Process. Syst. 1561–1568 (2007).
https://papers.nips.cc/paper/3104-nonnegative-sparse-pca
[38] Montanari, A., Richard, E.: Non-negative principal component analysis: message passing algorithms
and sharp asymptotics. IEEE Trans. Inf. Theory 62, 1458–1484 (2016)
[39] Carson, T., Mixon, D.G., Villar, S.: Manifold optimization for K-means clustering. Int. Conf. Sampl.
Theory. Appl. SampTA 73–77. IEEE (2017)
123
A Brief Introduction to Manifold Optimization 245
[40] Liu, H., Cai, J.-F., Wang, Y.: Subspace clustering by (k, k)-sparse matrix factorization. Inverse Probl.
Imaging 11, 539–551 (2017)
[41] Xie, T., Chen, F.: Non-convex clustering via proximal alternating linearized minimization method.
Int. J. Wavelets Multiresolut. Inf. Process. 16, 1840013 (2018)
[42] Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton
University Press, Princeton, NJ (2008)
[43] Absil, P.-A., Gallivan, K.A.: Joint diagonalization on the oblique manifold for independent compo-
nent analysis. Proc. IEEE Int. Conf. Acoust. Speech Signal Process 5, 945–958 (2006)
[44] Bhatia, R.: Positive Definite Matrices, vol. 24. Princeton University Press, Princeton (2009)
[45] Journée, M., Bach, F., Absil, P.-A., Sepulchre, R.: Low-rank optimization on the cone of positive
semidefinite matrices. SIAM J. Optim. 20, 2327–2351 (2010)
[46] Massart, E., Absil, P.-A.: Quotient geometry with simple geodesics for the manifold of fixed-rank
positive-semidefinite matrices. SIAM J. Matrix Anal. Appl. 41, 171–198 (2020)
[47] Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems
on Riemannian manifolds. Pac. J. Optim. 10, 415–434 (2014)
[48] Gabay, D.: Minimizing a differentiable function over a differential manifold. J. Optim. Theory Appl.
37, 177–219 (1982)
[49] Smith, S.T.: Optimization techniques on Riemannian manifolds. Fields Institute Communications 3
(1994)
[50] Kressner, D., Steinlechner, M., Vandereycken, B.: Low-rank tensor completion by Riemannian opti-
mization. BIT Numer. Math. 54, 447–468 (2014)
[51] Hu, J., Milzarek, A., Wen, Z., Yuan, Y.: Adaptive quadratically regularized Newton method for
Riemannian optimization. SIAM J. Matrix Anal. Appl. 39, 1181–1207 (2018)
[52] Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158
(2012)
[53] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
[54] Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program.
142, 397–434 (2013)
[55] Jiang, B., Dai, Y.-H.: A framework of constraint preserving update schemes for optimization on
Stiefel manifold. Math. Program. 153, 535–575 (2015)
[56] Zhu, X.: A Riemannian conjugate gradient method for optimization on the Stiefel manifold. Comput.
Optim. Appl. 67, 73–110 (2017)
[57] Siegel, J.W.: Accelerated optimization with orthogonality constraints, arXiv:1903.05204 (2019)
[58] Iannazzo, B., Porcelli, M.: The Riemannian Barzilai–Borwein method with nonmonotone line search
and the matrix geometric mean computation. IMA J. Numer. Anal. 00, 1–23 (2017)
[59] Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints.
SIAM J. Matrix Anal. Appl. 20, 303–353 (1999)
[60] Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold.
Neurocomputing 67, 106–135 (2005)
[61] Huang, W.: Optimization algorithms on Riemannian manifolds with applications, Ph.D. thesis, The
Florida State University (2013)
[62] Lezcano-Casado, M., Martínez-Rubio, D.: Cheap orthogonal constraints in neural networks: a simple
parametrization of the orthogonal and unitary group, arXiv:1901.08428 (2019)
[63] Li, J., Fuxin, L., Todorovic, S.: Efficient Riemannian optimization on the Stiefel manifold via the
Cayley transform, Conference arXiv:2002.01113 (2020)
[64] Huang, W., Gallivan, K.A., Absil, P.-A.: A Broyden class of quasi-Newton methods for Riemannian
optimization. SIAM J. Optim. 25, 1660–1685 (2015)
[65] Huang, W., Absil, P.-A., Gallivan, K.A.: Intrinsic representation of tangent vectors and vector trans-
ports on matrix manifolds. Numer. Math. 136, 523–543 (2017)
[66] Hu, J., Milzarek, A., Wen, Z., Yuan, Y.: Adaptive regularized Newton method for Riemannian
optimization, arXiv:1708.02016 (2017)
[67] Zhang, H., Hager, W.W.: A nonmonotone line search technique and its application to unconstrained
optimization. SIAM J. Optim. 14, 1043–1056 (2004)
[68] Udriste, C.: Convex Functions and Optimization Methods on Riemannian Manifolds, vol. 297.
Springer, Berlin (1994)
123
246 J. Hu et al.
[69] Absil, P.-A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found.
Comput. Math. 7, 303–330 (2007)
[70] Qi, C.: Numerical optimization methods on Riemannian manifolds, Ph.D. thesis, Florida State Uni-
versity (2011)
[71] Ring, W., Wirth, B.: Optimization methods on Riemannian manifolds and their application to shape
space. SIAM J. Optim. 22, 596–627 (2012)
[72] Seibert, M., Kleinsteuber, M., Hüper, K.: Properties of the BFGS method on Riemannian manifolds.
Mathematical System Theory C Festschrift in Honor of Uwe Helmke on the Occasion of his Sixtieth
Birthday, pp. 395–412 (2013)
[73] Huang, W., Absil, P.-A., Gallivan, K.A.: A Riemannian symmetric rank-one trust-region method.
Math. Program. 150, 179–216 (2015)
[74] Huang, W., Absil, P.-A., Gallivan, K.: A Riemannian BFGS method without differentiated retraction
for nonconvex optimization problems. SIAM J. Optim. 28, 470–495 (2018)
[75] Hu, J., Jiang, B., Lin, L., Wen, Z., Yuan, Y.-X.: Structured quasi-Newton methods for optimization
with orthogonality constraints. SIAM J. Sci. Comput. 41, A2239–A2269 (2019)
[76] Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and
Financial Engineering, 2nd edn. Springer, New York (2006)
[77] Wu, X., Wen, Z., Bao, W.: A regularized Newton method for computing ground states of Bose–
Einstein condensates. J. Sci. Comput. 73, 303–329 (2017)
[78] Kass, R.E.: Nonlinear regression analysis and its applications. J. Am. Stat. Assoc. 85, 594–596
(1990)
[79] Sun, W., Yuan, Y.: Optimization Theory and Methods: Nonlinear Programming, vol. 1. Springer,
Berlin (2006)
[80] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 (2015)
[81] Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control.
58, 2217–2229 (2013)
[82] Zhang, H., Sra, S.: First-order methods for geodesically convex optimization, In: Conference on
Learning Theory, pp. 1617–1638 (2016)
[83] Liu, Y., Shang, F., Cheng, J., Cheng, H., Jiao, L.: Accelerated first-order methods for geodesically
convex optimization on Riemannian manifolds. Adv. Neural Inf. Process. Syst. 4868–4877 (2017)
[84] Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian
manifolds. Adv. Neural Inf. Process. Syst. 4592–4600 (2016)
[85] Sato, H., Kasai, H., Mishra, B.: Riemannian stochastic variance reduced gradient algorithm with
retraction and vector transport. SIAM J. Optim. 29, 1444–1472 (2019)
[86] Jiang, B., Ma, S., So, A.M.-C., Zhang, S.: Vector transport-free svrg with general retraction for
Riemannian optimization: Complexity analysis and practical implementation, arXiv:1705.09059
(2017)
[87] Bécigneul, G., Ganea, O.-E.: Riemannian adaptive optimization methods, arXiv:1810.00760 (2018)
[88] Dirr, G., Helmke, U., Lageman, C.: Nonsmooth Riemannian optimization with applications to sphere
packing and grasping. In: Lagrangian and Hamiltonian Methods for Nonlinear Control 2006, pp.
29–45. Springer, Berlin (2007)
[89] Borckmans, P.B., Selvan, S.E., Boumal, N., Absil, P.-A.: A Riemannian subgradient algorithm for
economic dispatch with valve-point effect. J Comput. Appl. Math. 255, 848–866 (2014)
[90] Hosseini, S.: Convergence of nonsmooth descent methods via Kurdyka–Lojasiewicz inequality on
Riemannian manifolds, Hausdorff Center for Mathematics and Institute for Numerical Simulation,
University of Bonn (2015). https://ins.uni-bonn.de/media/public/publication-media/8_INS1523.
pdf
[91] Grohs, P., Hosseini, S.: Nonsmooth trust region algorithms for locally Lipschitz functions on Rie-
mannian manifolds. IMA J. Numer. Anal. 36, 1167–1192 (2015)
[92] Hosseini, S., Uschmajew, A.: A Riemannian gradient sampling algorithm for nonsmooth optimiza-
tion on manifolds. SIAM J. Optim. 27, 173–189 (2017)
[93] Bacák, M., Bergmann, R., Steidl, G., Weinmann, A.: A second order nonsmooth variational model
for restoring manifold-valued images. SIAM J. Sci. Comput. 38, A567–A597 (2016)
[94] de Carvalho Bento, G., da Cruz Neto, J.X., Oliveira, P.R.: A new approach to the proximal point
method: convergence on general Riemannian manifolds. J Optim. Theory Appl. 168, 743–755 (2016)
[95] Bento, G., Neto, J., Oliveira, P.: Convergence of inexact descent methods for nonconvex optimization
on Riemannian manifolds, arXiv:1103.4828 (2011)
123
A Brief Introduction to Manifold Optimization 247
[96] Bento, G.C., Ferreira, O.P., Melo, J.G.: Iteration-complexity of gradient, subgradient and proximal
point methods on Riemannian manifolds. J Optim. Theory Appl. 173, 548–562 (2017)
[97] Chen, S., Ma, S., So, A.M.-C., Zhang, T.: Proximal gradient method for nonsmooth optimization
over the Stiefel manifold. SIAM J. Optim. 30, 210–239 (2019)
[98] Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection
steps for composite convex programs. J. Sci. Comput. 76, 1–26 (2018)
[99] Huang, W., Wei, K.: Riemannian proximal gradient methods, arXiv:1909.06065 (2019)
[100] Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse
PCA and sparse cca, arXiv:1903.11576 (2019)
[101] Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA,
arXiv:1909.05485 (2019)
[102] Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58,
431–449 (2014)
[103] Kovnatsky, A., Glashoff, K., Bronstein, M.M.: Madmm: a generic algorithm for non-smooth opti-
mization on manifolds. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision ECCV,
pp. 680–696. Springer, Berlin (2016)
[104] Wang, Y., Yin, W., Zeng, J.: Global convergence of admm in nonconvex nonsmooth optimization.
J. Sci. Comput. 78, 29–63 (2019)
[105] Zhang, J., Ma, S., Zhang, S.: Primal-dual optimization algorithms over Riemannian manifolds: an
iteration complexity analysis, arXiv:1710.02236 (2017)
[106] Birgin, E.G., Haeser, G., Ramos, A.: Augmented lagrangians with constrained subproblems and
convergence to second-order stationary points. Comput. Optim. Appl. 69, 51–75 (2018)
[107] Liu, C., Boumal, N.: Simple algorithms for optimization on Riemannian manifolds with constraints,
arXiv:1901.10000 (2019)
[108] Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on
manifolds. IMA J. Numer. Anal. 39, 1–33 (2018)
[109] Zhang, J., Zhang, S.: A cubic regularized Newton’s method over Riemannian manifolds,
arXiv:1805.05565 (2018)
[110] Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds
with a first-order analysis, arXiv:1806.00065 (2018)
[111] Vishnoi, N.K.: Geodesic convex optimization: differentiation on manifolds, geodesics, and convexity,
arXiv:1806.06373 (2018)
[112] Liu, X., Wang, X., Wen, Z., Yuan, Y.: On the convergence of the self-consistent field iteration in
Kohn–Sham density functional theory. SIAM J. Matrix Anal. Appl. 35, 546–558 (2014)
[113] Liu, X., Wen, Z., Wang, X., Ulbrich, M., Yuan, Y.: On the analysis of the discretized Kohn-Sham
density functional theory. SIAM J. Numer. Anal. 53, 1758–1785 (2015)
[114] Cai, Y., Zhang, L.-H., Bai, Z., Li, R.-C.: On an eigenvector-dependent nonlinear eigenvalue problem.
SIAM J. Matrix Anal. Appl. 39, 1360–1382 (2018)
[115] Bai, Z., Lu, D., Vandereycken, B.: Robust Rayleigh quotient minimization and nonlinear eigenvalue
problems. SIAM J. Sci. Comput. 40, A3495–A3522 (2018)
[116] Yuan, H., Gu, X., Lai, R., Wen, Z.: Global optimization with orthogonality constraints via stochastic
diffusion on manifold. J. Sci. Comput. 80, 1139–1170 (2019)
[117] Barvinok, A.I.: Problems of distance geometry and convex properties of quadratic maps. Discrete
Comput. Geom. 13, 189–202 (1995)
[118] Pataki, G.: On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal
eigenvalues. Math. Oper. Res. 23, 339–358 (1998)
[119] Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite programs
via low-rank factorization. Math. Program. 95, 329–357 (2003)
[120] Boumal, N., Voroninski, V., Bandeira, A.: The non-convex Burer–Monteiro approach works on
smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2757–
2765 (2016). https://papers.nips.cc/paper/6517-the-non-convex-burer-monteiro-approach-works-
on-smooth-semidefinite-programs.pdf
[121] Mei, S., Misiakiewicz, T., Montanari, A., Oliveira, R.I.: Solving SDPs for synchronization and
maxcut problems via the Grothendieck inequality, arXiv:1703.08729 (2017)
[122] Bandeira, A.S., Boumal, N., Voroninski, V.: On the low-rank approach for semidefinite programs
arising in synchronization and community detection. Conf. Learn. Theor. 361–382 (2016)
123
248 J. Hu et al.
[123] Burer, S., Monteiro, R.D.: Local minima and convergence in low-rank semidefinite programming.
Math. Program. 103, 427–444 (2005)
[124] Boumal, N., Voroninski, V., Bandeira, A.S.: Deterministic guarantees for Burer–Monteiro factoriza-
tions of smooth semidefinite programs, arXiv:1804.02008 (2018)
[125] Bandeira, A.S., Kennedy, C., Singer, A.: Approximating the little Grothendieck problem over the
orthogonal and unitary groups. Math. Program. 160, 433–475 (2016)
123