0% found this document useful (0 votes)
28 views32 pages

Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views32 pages

Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Will Bilevel Optimizers Benefit from Loops

Kaiyi Ji∗, Mingrui Liu†, Yingbin Liang‡ and Lei Ying§


June 2, 2022
arXiv:2205.14224v3 [cs.LG] 1 Jun 2022

Abstract
Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems.
Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-
problems, and consequently, whether we solve these problems with loops (that take many iterations) or
without loops (that take only a few iterations) can significantly affect the overall computational efficiency.
Existing studies in the literature cover only some of those implementation choices, and the complexity
bounds available are not refined enough to enable rigorous comparison among different implementations.
In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are
applicable to all implementation choices of loops. We then specialize our results to characterize the
computational complexity for all implementations, which enable an explicit comparison among them.
Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is
beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop
for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For
ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such
loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from
an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our
theoretical results.

1 Introduction
Bilevel optimization has attracted significant attention recently due to its popularity in a variety of machine
learning applications including meta-learning (Franceschi et al., 2018; Bertinetto et al., 2018; Rajeswaran
et al., 2019; Ji et al., 2020a), hyperparameter optimization (Franceschi et al., 2018; Shaban et al., 2019;
Feurer & Hutter, 2019), reinforcement learning (Konda & Tsitsiklis, 2000; Hong et al., 2020), and signal
processing (Kunapuli et al., 2008; Flamary et al., 2014). In this paper, we consider the bilevel optimization
problem that takes the following formulation.

min Φ(x) := f (x, y ∗ (x)) s.t. y ∗ (x) = arg min g(x, y), (1)
x∈Rp y∈Rq

where the outer- and inner-level functions f and g are both jointly continuously differentiable. We focus on
the setting where the lower-level function g is strongly convex with respect to (w.r.t.) y with the condition
number κ = L µ (where L and µ are gradient Lipschitzness and strong convexity coefficients defined respectively
∗ Department of EECS, University of Michigan, Ann Arbor; e-mail: [email protected]
† Department of CS, George Mason University; e-mail: [email protected]
‡ Department of ECE, The Ohio State University; e-mail: [email protected]
§ Department of EECS, University of Michigan, Ann Arbor; e-mail: [email protected]

1
Table 1: Comparison of computational complexities of four AID-BiO implementations for finding an -accurate
stationary point. For a fair comparison, gradient descent (GD) is used to solve the linear system for all
algorithms. MV(): the total number of Jacobian- and Hessian-vector product computations. Gc(): the
e hide ln κ factors.
total number of gradient computations. O: 

Algorithms Q N MV() Gc()


1
(k+1) 4
BA (Ghadimi & Wang, 2018) Θ(κ ln κ) 2 (k: iteration number) e 5 −1 )
O(κ e 5 −1.25 )
O(κ

AID-BiO (Ji et al., 2021) Θ(κ ln κ) Θ(κ ln κ) e 4 −1 )


O(κ e 4 −1 )
O(κ

N -Q-loop AID (this paper) Θ(κ ln κ) Θ(κ ln κ) e 4 −1 )


O(κ e 4 −1 )
O(κ

Q-loop AID (this paper) Θ(κ ln κ) 1 e 6 −1 )


O(κ e 5 −1 )
O(κ

N -loop AID (this paper) O(1) Θ(κ ln κ) e 4 −1 )


O(κ e 5 −1 )
O(κ

No-loop AID (this paper) O(1) 1 e 6 −1 )


O(κ e 6 −1 )
O(κ

in Assumptions 1 and 3 in Section 3), and the outer-level objective function Φ(x) is possibly nonconvex
w.r.t. x. Such types of geometries arise in many applications including meta-learning (which uses the last layer
of neural networks as adaptation parameters), hyperparameter optimization (e.g., data hyper-cleaning and
regularized logistic regression) and learning in communication networks (e.g., network utility maximization).
A variety of algorithms have been proposed to solve the bilevel optimization problem in eq. (1). For
example, Hansen et al. (1992); Shi et al. (2005); Moore (2010) proposed constraint-based approaches by
replacing the inner-level problem with its optimality conditions as constraints. In comparison, gradient-based
bilevel algorithms have received intensive attention recently due to the effectiveness and simplicity, which
include two popular approaches via approximate implicit differentiation (AID) (Domke, 2012; Pedregosa,
2016; Grazzi et al., 2020; Ji et al., 2021) and iterative differentiation (ITD) (Maclaurin et al., 2015; Franceschi
et al., 2017; Shaban et al., 2019). Readers can refer to Appendix A for an expanded list of related work.
Consider the AID-based bilevel approach (which we call AID-BiO). Its base iteration loop updates the
variable x until convergence. Within such a base loop, it needs to solve two sub-problems: finding a nearly
optimal solution of the inner-level function via N iterations, and approximating the outer-level Hessian-
inverse-vector product via Q iterations. If Q and N are chosen to be large, then the corresponding iterations
form additional loops of iterations within the base loop, which we respectively call as Q-loop and N -loop.
Thus, AID-BiO can have four popular implementations depending on different choices of N and Q: N -loop
(with large N = κ ln κ and small Q = O(1)), N -Q-loop (with large N = Θ(κ ln κ) and large Q = Θ(κ ln κ)),
Q-loop (with N = 1 and Q = Θ(κ ln κ)), and No-loop (with N = 1 and Q = O(1)). Note that No-loop
refers to no additional loops within the base loop, and can be understood as conventional single-(base)-loop
algorithms. These implementations can significantly affect the efficiency of AID-BiO. Generally, large Q
(i.e., a Q-loop) provides a good approximation of the Hessian-inverse-vector product for the hypergradient
computation, and large N (i.e., a N -loop) finds an accurate optimal point of the inner function. Hence, an
algorithm with N -loop and Q-loop require fewer base-loop steps to converge, but each such base-loop step
requires more computations due to these loops. On the other hand, small Q and/or N avoid computations
of loops in each base-loop step, but can cause the algorithm to converge with many more base-loop steps.
An intriguing question here is which implementation is overall most efficient and whether AID-BiO benefits
from having N -loop and/or Q-loop. Existing theoretical studies on AID-BiO are far from answering this

2
Table 2: Comparison of computational complexities of two ITD-BiO implementations for finding an -accurate
stationary point. For a fair comparison, gradient descent (GD) is used to solve the inner-level problem. The
analysis in Ji et al. (2021) for ITD-BiO assumes that the inner-loop minimizer y ∗ (xk ) is bounded at k th
iteration, which is not required in our analysis. µ: the strong-convexity constant of inner-level function g(x, ·).
For the last two columns, ’N/A’ means that the complexities to achieve an -accuracy are not measurable
due to the nonvanishing convergence error.

Algorithms N Convergence rate MV() Gc()


 3 
ITD-BiO (Ji et al., 2021) Θ(κ ln κ) O κK +  e 4 −1 )
O(κ e 4 −1 )
O(κ
 3 
N -N -loop ITD (this paper) Θ(κ ln κ) O κK +  e 4 −1 )
O(κ e 4 −1 )
O(κ
 3 
No-loop ITD (this paper) Θ(1) O κK + κ3 N/A N/A

Lower bound (this paper) N/A N/A



Θ(1) Ω κ2

question. The studies (Ghadimi & Wang, 2018; Ji et al., 2021) on deterministic AID-BiO focused only on the
N -Q-loop scheme. A few studies analyzed the stochastic AID-BiO, such as Li et al. (2021) on No-loop, and
Hong et al. (2020); Khanduri et al. (2021) on Q-loop. Those studies were not refined enough to capture the
computational differences among different implementations, and further those studies collectively did not
cover all the four implementations either.

• The first contribution of this paper lies in the development of a unified convergence theory for
AID-BiO, which is applicable to all choices of N and Q. We further specialize our general theorems
to provide the computational complexity for all of the above four implementations (as summarized
in Table 1). Comparison among them suggests that AID-BiO does benefit from both N -loop and
Q-loop. This is in contrast to minimax optimization (a special case of bilevel optimization), where
it is shown in Lin et al. (2020); Zhang et al. (2020) that (No-loop) gradient descent ascent (GDA)
with N = 1 often outperforms (N -loop) GDA with N = κ ln κ (here N denotes the number of
ascent iterations for each descent iteration). To explain the reason, the gradient w.r.t. x in bilevel
optimization involves additional second-order derivatives (that do not exist in minimax optimization),
which are more sensitive to the accuracy of the optimal point of the inner function. Therefore, a
large N finds such a more accurate solution, and is hence more beneficial for bilevel optimization
than minimax optimization.

Differently from AID-BiO, the ITD-based bilevel approach (which we call as ITD-BiO) constructs the
outer-level hypergradient estimation via backpropagation along the N -loop iteration path, and Q = N always
holds. Thus, ITD-BiO has only two implementation choices: N -N -loop (with large N = κ ln κ) and No-loop
(with small N = O(1)). Here, N -N -loop and No-loop also refer to additional loops for solving sub-problems
within the ITD-BiO’s base loop of updating the variable x. The only convergence rate analysis on ITD-BiO
was provided in Ji et al. (2021) but only for N -N -loop, which does not suggest how N -N -loop compares with
No-loop. It is still an open question whether ITD-BiO benefits from N -loops.
• The second contribution of this paper lies in the development of a unified convergence theory for
ITD-BiO, which is applicable to all values of N . We then specialize our general theorem to provide
the computational complexity for both of the above implementations (as summarized in Table 2). We
further develop a convergence lower bound, which suggests that N -N -loop is necessary to guarantee

3
Algorithm 1 AID-based bilevel optimization (AID-BiO) with double warm starts
1: Input: Stepsizes α, β, η > 0, initializations x0 , y0 , v0 .
2: for k = 0, 1, 2, ..., K do
3: Set yk0 = yk−1
N
if k > 0 and y0 otherwise (warm start initialization)
4: for t = 1, ...., N do
5: Update ykt = ykt−1 − α∇y g(xk , ykt−1 )
6: end for
7: Hypergradient estimation via:
Q
Set vk0 = vk−1 if k > 0 and v0 otherwise (warm start initalization).
Solve vkQ from ∇2y g(xk , ykN )v = ∇y f (xk , ykN ) iteratively with Q steps, stepsize η and initialization vk0
N Q
Compute ∇Φ(x
b N
k ) = ∇x f (xk , yk ) − ∇x ∇y g(xk , yk )vk
8: Update xk+1 = xk − β ∇Φ(x
b k)
9: end for

a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing
convergence error.
The technical contribution of this paper is two-fold. For AID methods, most existing studies including Ji et al.
(2021) solve the linear system with large Q = Θ(κ log κ) so that the upper-level Hessian-inverse-vector product
approximation error can vanish. In contrast, we allow arbitrary (possibly small) Q, and hence this upper-level
error can be large and nondecreasing, posing a key challenge to guarantee convergence. We come up with a
novel idea to prove the convergence by showing that this error, not by itself but jointly with the inner-loop
error, admits an (approximately) iteratively decreasing property, which bounds the hypergradient error and
yields convergence. The analysis contains new developments to handle the coupling between this error and
the inner-loop error, which is critical in our proof. For ITD methods, unlike existing studies including Ji
et al. (2021), we remove the boundedness assumption on y ∗ (x) via a novel error analysis over the entire
execution rather than a single iteration. Our analysis tools are general and can be extended to stochastic and
acceleration bilevel optimizers.

2 Algorithms
2.1 AID-based Bilevel Optimization Algorithm
As shown in Algorithm 1, we present the general AID-based bilevel optimizer (which we refer to AID-BiO for
short). At each iteration k of the base loop, AID-BiO first executes N steps of gradient decent (GD) over the
inner function g(x, y) to find an approximation point ykN , where N can be chosen either at a constant level or
as large as N = κ ln κ (which forms an N -loop of iterations). Moreover, to accelerate the practical training
and achieve a stronger performance guarantee, AID-BiO often adopts a warm-start strategy by setting the
initialization yk0 of each N -loop to be the output yk−1
N
of the preceding N -loop rather than a random start.
To update the outer variable, AID-BiO adopts the gradient descent, by approximating the true gradient
∇Φ(xk ) of the outer function w.r.t. x (called hypergradient) that takes the following form:

(True hypergradient:) ∇Φ(xk ) =∇x f (xk , y ∗ (xk )) − ∇x ∇y g(xk , y ∗ (xk ))vk∗ , (2)

where vk∗ is the solution of the linear system ∇2y g(xk , y ∗ (xk ))v = ∇y f (xk , y ∗ (xk )). To approximate the above
true hypergradient, AID-BiO first solves vkQ as an approximate solution to a linear system ∇2y g(xk , ykN )v =

4
∇y f (xk , ykN ) using Q steps of GD iterations with stepsize η starting from vk0 . Here, Q can also be chosen
either at a constant level or as large as Q = κ ln µκ (which forms a Q-loop of iterations). Note that a warm
Q
start is also adopted here by setting vk0 = vk−1 , which is critical to achieve the convergence guarantee for
small Q. If Q is large enough, e.g., at an order of κ ln κ , a zero initialization with vk0 = 0 suffices to solve the
linear system well. Then, AID-BiO constructs a hypergradient estimator ∇Φ(x b k ) given by

N Q
(AID-based hypergradient estimate:) ∇Φ(x
b N
k ) = ∇x f (xk , yk ) − ∇x ∇y g(xk , yk )vk . (3)

Note that the execution of AID-BiO involves only Hessian-vector products in solving the linear system and
Jacobian-vector product ∇x ∇y g(xk , ykN )vkQ which are more computationally tractable than the calculation of
second-order derivatives.
It is clear that different choices of N and Q lead to four implementations within the base loop of AID-BiO:
N -loop (with large N = κ ln κ and small Q = O(1)), N -Q-loop (with large N = κ ln κ and Q = κ ln κ),
Q-loop (with small N = 1 and large Q = κ ln κ) and No-loop (with small N = 1 and Q = O(1)). In Section 4,
we will establish a unified convergence theory for AID-BiO applicable to all its implementations in order to
formally compare their computational efficiency.

2.2 ITD-Based Bilevel Optimization Algorithm


As shown in Algorithm 2, the ITD-based bilevel optimizer (which we refer to as ITD-BiO) updates the inner
variable y similarly to AID-BiO, and obtains the N -step output ykN of GD with a warm-start initialization.
ITD-BiO differentiates from AID-BiO mainly in its estimation of the hypergradient. Without leveraging the
N
∂f (xk ,yk )
implicit gradient formulation, ITD-BiO computes a direct derivative ∂xk via automatic differentiation
for hypergradient approximation. Since yk has a dependence on xk through the N -loop iterative GD
N

updates, the execution of ITD-BiO takes the backpropagation over the entire N -loop trajectory. To elaborate,
N
∂f (xk ,yk )
it can be shown via the chain rule that the hypergradient estimate ∂xk takes the following form
N
of ∂f (x∂xk k,yk ) = ∇x f (xk , ykN ) − α N −1 t
j=t+1 (I − α∇y g(xk , yk ))∇y f (xk , yk ). As shown in this
N −1 2 j N
P Q
t=0 ∇x ∇y g(xk , yk )
equation, the differentiation does not compute the second-order derivatives directly but compute more
tractable and economical Hessian-vector products ∇2y g(xk , ykj−1 )vj , j = 1, ..., N (similarly for Jacobian-vector
products), where each vj is obtained recursively via vj−1 = (I − α∇2y g(xm , ym j
))vj with vN = ∇y f (xm , ymN
).
Clearly, the implementation of ITD-BiO implies that N = Q always holds. Hence, ITD-BiO takes only
two possible architectures within its base loop: N -N -loop (with large N = κ ln κ ) and No-loop (with small
N = 1). In Section 5, we will establish a unified convergence theory for ITD-BiO applicable to both of its
implementations in order to formally compare their computational efficiency.

3 Definitions and Assumptions


This paper focuses on the following types of objective functions.

Assumption 1. The inner-level function g(x, y) is µ-strongly-convex w.r.t. y.

Since the objective function Φ(x) in eq. (1) is possibly nonconvex, algorithms are expected to find an
-accurate stationary point defined as follows.

Definition 1. We say x̄ is an -accurate stationary point for the bilevel optimization problem given in eq. (1)
if k∇Φ(x̄)k2 ≤ , where x̄ is the output of an algorithm.

5
Algorithm 2 ITD-based bilevel optimization algorithm (ITD-BiO) with warm start
1: Input: Stepsize α > 0, initializations x0 and y0 .
2: for k = 0, 1, 2, ..., K do
3: Set yk0 = yk−1
N
if k > 0 and y0 otherwise (warm start initialization)
4: for t = 1, ...., N do
5: Update ykt = ykt−1 − α∇y g(xk , ykt−1 )
6: end for
N
∂f (xk ,yk )
7: Compute ∇Φ(x
b k) = xk via backpropagation w.r.t. xk
8: Update xk+1 = xk − β ∇Φ(x
b k)
9: end for

In order to compare the performance of different bilevel algorithms, we adopt the following metrics of
computational complexity.

Definition 2. Let Gc() be the number of gradient evaluations, and MV() be the total number of Jacobian-
and Hession-vector product evaluations to achieve an -accurate stationary point of the bilevel optimization
problem in eq. (1).

Let z = (x, y). We take the following standard assumptions, as also widely adopted by Ghadimi & Wang
(2018); Ji et al. (2020a).

Assumption 2. Gradients ∇f (z) and ∇g(z) are L-Lipschitz, i.e., for any z, z 0 ,

k∇f (z) − ∇f (z 0 )k ≤ Lkz − z 0 k, k∇g(z) − ∇g(z 0 )k ≤ Lkz − z 0 k.

As shown in eq. (2), the gradient of the objective function Φ(x) involves the second-order derivatives
∇x ∇y g(z) and ∇2y g(z). The following assumption imposes the Lipschitz conditions on such higher-order
derivatives, as also made in Ghadimi & Wang (2018).

Assumption 3. Suppose the derivatives ∇x ∇y g(z) and ∇2y g(z) are ρ-Lipschitz, i.e., for any z, z 0

k∇x ∇y g(z) − ∇x ∇y g(z 0 )k ≤ ρkz − z 0 k, k∇2y g(z) − ∇2y g(z 0 )k ≤ ρkz − z 0 k.

To guarantee the boundedness the hypergradient estimation error, existing works (Ghadimi & Wang, 2018;
Ji et al., 2020a; Grazzi et al., 2020) assume that the gradient ∇f (z) is bounded for all z = (x, y). Instead, we
make a weaker boundedness assumption on the gradients ∇y f (x, y ∗ (x)).

Assumption 4. There exists a constant M such that for any x, k∇y f (x, y ∗ (x))k ≤ M .

For the case where the total objective function Φ(·) has some benign structures, e.g., convexity or strong
convexity, Assumption 4 can be removed by an induction analysis that all iterates are bounded as in Ji &
Liang (2021). Assumption 4 can also be removed by projecting x onto a bounded constraint set X .

4 Convergence Analysis of AID-BiO


As we describe in Section 2.1, AID-BiO can have four possible implementations depending on whether N
and Q are chosen to be large enough to form an N -loop and/or Q-loop. In this section, we will provide the
convergence analysis and characterize the overall computational complexity for all of the four implementations,
which will provide the general guidance on which algorithmic architecture is computationally most efficient.

6
4.1 Convergence Rate and Computational Complexity
In this subsection, we develop two unified theorems for AID-BiO, both of which are applicable to all the
regimes of N and Q. We then specialize these theorems to provide the complexity bounds (as corollaries) for
the four implementations of AID-BiO. It turns out that the first theorem provides tighter complexity bounds
for the implementations with small Q = Θ(1), and the second theorem provides tighter complexity bounds for
the implementations with large Q = κ ln κ . Our presentation of those corollaries below will thus focus only
on the tighter bounds. The following theorem provides our first unified convergence analysis for AID-BiO.

Theorem 1. Suppose Assumptions 1, 2, 3 and 4 hold. Choose parameters α, η and λ such that (1 + λ)(1 −
2
1 CQ Q(1−ηµ)Q−1 ρM η 1−(1−ηµ)Q (1+ηQµ)
αµ)N (1 + 4r(1 + 2
ηµ )L ) ≤ 1 − ηµ, where r = ( ρM +L)2
with CQ = µ
+ µ2
ρM +
µ
2 2 3 2
2L +ρM
(1 − (1 − ηµ)Q ) L
µ
. Let LΦ = L + µ + 2ρLM
µ2
+L
+ ρLµ3M be the smoothness parameter of Φ(·). Let
2 2 2  2 ρ2 M 2  16(1−ηµ)2Q 4(1−ηµ)ηµ  L2
we := (1−ηµ)ηµ
3λrL2 1 + ρL2M L
2 + 1 + ηµ
1
L + µ2 + 3λL2 µ2 . Choose the outer stepsize β
 µ12 µp ηµ
µ2
such that β = min 12LΦ , 18L 2we . Then,

ρ2 M 2 ∗ 2 ∗ 2
K−1
1 X
2
8(Φ(x0 ) − Φ(x∗ )) 21L ((1 + L2 µ2 )ky0 k + ( 3M
µ +
2L
µ ky0 k) )
k∇Φ(xk )k2 ≤ + . (4)
K βK ηµK
k=0

Theorem 1 also elaborates the precise requirements on the stepsizes α, η and β and the auxiliary parameter
λ, which take complicated forms. In the following, by further specifying these parameters, we characterize
the complexities for AID-BiO in more explicit forms. We focus on the implementations with Q = Θ(1) (for
which Theorem 1 specializes to tighter bound than Theorem 2 below), which includes the N -loop scheme
(with N = Θ(κ ln κ)) and the No-loop scheme (with N = 1).

Corollary 1 (N -loop). Consider N -loop AID-BiO with N = Θ(κ ln κ) and Q = Θ(1), where κ = L µ denotes
the condition number of the inner problem. Under the same setting of Theorem 1, choose η = L1 , α = L1 ,
κ4 κ3
and λ = 1. Then, we have K1 K−1 2
P 
k=0 k∇Φ(xk )k = O K + K , and the complexity to achieve an -accurate
e 5 −1 ), MV() = O
e κ4 −1 .

stationary point is Gc() = O(κ

Corollary 2 (No-loop). Consider No-loop AID-BiO with N = 1 and Q = Θ(1). Under the same setting of
2
Theorem 1, choose parameters α = L1 , λ = αµ 1 αµ α 1 1
PK−1 2
2 and η = min{ 128 Q2 L2 , 4 , µQ }. Then, K k=0 k∇Φ(xk )k =
κ6 κ5 e 6 −1 ), MV() = O(κ
e 6 −1 ).

O K
+ K
, and the complexity is Gc() = O(κ

The analysis of Theorem 1 can be further improved for the large Q regime, which guarantees a sufficiently
small outer-level approximation error, and helps to relax the requirement on the stepsize η. Such an adaptation
yields the following alternative unified convergence characterization for AID-BiO, which is applicable for all
Q and N , but specializes to tighter complexity bounds than Theorem 1 in the large Q regime. For simplicity,
we set the initialization vk0 = 0 in Algorithm 1.

Theorem 2. Suppose Assumptions 1, 2, 3 and 4 hold. Define τ = (1 − αµ)N (1 + λ + 6(1 + λ−1 )(L2 + ρ2 M 2 µ−2 +
 2 2 −2
2L2 CQ
2
L β µ ), w = 6(1 − αµ)N (L2 + ρ2 M 2 µ−2 + 2L2 CQ
2
)(1 + λ−1 )L2 µ−2 , where CQ is a positive constant
defined as in Theorem 1. Choose parameters α, β such that τ < 1 and βLΦ + wβ 2 12 + βLΦ 1−τ
 1
≤ 41 hold.
Then, the output of AID-BiO satisfies
K−1
1 X 4(Φ(x0 ) − Φ(x∗ )) 3 δ0 27L2 M 2
k∇Φ(xk )k2 ≤ + + (1 − ηµ)2Q ,
K βK K 1−τ µ2
k=0

ρ M2
2
where δ0 = 3 L2 + + 2L2 CQ
2
(1 − αµ)N ky0∗ − y0 k2 is the initial distance.

µ2

7
We next specialize Theorem 2 to obtain the complexity for two implementations of AID-BiO with
Q = Θ(κ ln κ): N -Q-loop (with N = Θ(κ ln κ)) and Q-loop (with N = 1), as shown in the following two
corollaries. For each case, we need to set the parameters λ, η and α in Theorem 2 properly.

Corollary 3 (N -Q-loop). Consider N -Q-loop AID-BiO with N = Θ(κ ln κ) and Q = Θ(κ ln κ ). Under the
κ3
same setting of Theorem 2, choose η = α = L1 , λ = 1 and β = Θ(κ−3 ). Then, K1 K−1 2
P 
k=0 k∇Φ(xk )k = O K +  ,
e 4 −1 ), MV() = O(κ
and the complexity is Gc() = O(κ e 4 −1 ).

Corollary 4 (Q-loop). Consider Q-loop AID-BiO with N = 1 and Q = Θ(κ ln κ ). Under the same setting of
κ5 κ4
Theorem 2, choose α = η = L1 , λ = αµ −4
). Then, K1 K−1 2
P 
2 and β = Θ(κ k=0 k∇Φ(xk )k = O K + K +  , and
e 5 −1 ), MV() = O(κ
the complexity is Gc() = O(κ e 6 −1 ).

Discussion on hyperparameter selection for different implementations. For all loop-sizes, we set
the hyperparameters to achieve the best complexity as long as convergence is guaranteed. Let us elaborate on
N -loop (Corollary 1) and No-loop (Corollary 2). At a proof level, λ needs to satisfy (1 − αµ)N (1 + λ) < 1 (see
Lemma 2) to guarantee the convergence; otherwise the inner-loop error will explode. Given this requirement,
for N -loop with N = Θ(κ log κ), λ = Θ(1) achieves the best complexity. However, for No-loop with N = 1, the
requirement becomes (1 − αµ)(1 + λ) < 1, and λ = Θ(µ) achieves the best complexity. The stepsize η appears in
(1 − αµ)N µη kyk−1
N ∗
− yk−1 k2 ) (see Lemma 1) of the error kvkQ − vk∗ k2 . Given the requirement (1 − αµ)N µη < 1, for
N -loop with N = Θ(κ log κ), η = Θ(1) achieves the best complexity, whereas for No-loop with N = 1, the best
η = Θ(µ). At a conceptual level, estimating the hypergradient and linear system contains the inner-loop
error kykN − yk∗ k2 . For N = 1, the per-iteration error is large, and hence we need smaller stepsizes λ, η, β to
ensure the accumulated error not to explode. A similar argument holds for N -Q-loop and Q-loop.

4.2 Comparison among Four Implementations

Impact of N -loop (N = 1 vs N = κ ln κ). We fix Q, and compare how the choice of N affects the
computational complexity. First, let Q = Θ(1), and compare the results between the two implementations
N -loop with Θ(κ ln κ) (Corollary 1) and No-loop with N = 1 (Corollary 2). Clearly, the N -loop scheme
6 4
significantly improves the convergence rate of the No-loop scheme from O( κK ) to O( κK ), and improves the
matrix-vector and gradient complexities from O(κe 6 −1 ) and O(κ e 6 −1 ) to O(κ
e 4 −1 ) and O(κ
e 5 −1 ), respectively.
To explain intuitively, the hypergradient estimation involves a coupled error ηkykN − y ∗ (xk )k induced from
solving the linear system ∇2y g(xk , ykN )v = ∇y f (xk , ykN ) with stepsize η. Therefore, a smaller inner-level
approximation error kykN − y ∗ (xk )k allows a more aggressive stepsize η, and hence yields a faster convergence
rate as well as a lower total complexity, as also demonstrated in our experiments. It is worth noting that such
a comparison is generally different from that in minimax optimization (Lin et al., 2020; Zhang et al., 2020),
where alternative (i.e., No-loop) gradient descent ascent (GDA) with N = 1 outperforms (N-loop) GDA
with N = κ ln κ, where N denotes the number of ascent iterations for each descent iteration. To explain the
reason, in constrast to minimax optimization, the gradient w.r.t. x in bilevel optimization involves additional
second-order derivatives, which are more sensitive to the inner-level approximation error. Therefore, a
larger N is more beneficial for bilevel optimization than minimax optimization. Similarly, we can also fix
Q = Θ(κ ln κ), the N -Q-loop scheme with N = κ ln κ (Corollary 3) significantly outperforms the Q-loop
scheme with N = 1 (Corollary 4) in terms of the convergence rate and complexity.
Impact of Q-loop (Q = 1 vs Q = Θ(κ ln κ )). We fix N , and characterize the impact of the choice of Q on
the complexity. For N = 1, comparing No-loop with Q = Θ(1) in Corollary 2 and Q-loop with Q = Θ(κ ln κ)
in Corollary 4 shows that both choices of Q yield the same matrix-vector complexity O(κ
e 6 −1 ), but Q-loop

8
with a larger Q improves the gradient complexity of No-loop with Q = Θ(1) from O(κ
e 6 −1 ) to O(κ
e 5 −1 ).
A similar phenomenon can be observed for N = Θ(κ ln κ) based on the comparision between N -Q-loop in
Corollary 3 and N -loop in Corollary 1.
In deep learning. Also note that in the setting where the matrix-vector complexity dominates the gradient
complexity, e.g., in deep learning, such two choices of Q do not affect the total computational complexity.
However, a smaller Q can help reduce the per-iteration load on the computational resource and memory, and
hence is preferred in practical applications with large models.
Comparison among four implementations. By comparing the complexity results in Corollaries 1, 2,
3 and 4, it can be seen that N -Q-loop and N -loop (both with a large N = Θ(κ ln κ)) achieve the best
matrix-vector complexity O(κ
e 4 −1 ), whereas Q-loop and No-loop (both with a smaller N = 1) require higher
matrix-vector complexity of O(κ
e 6 −1 ). Also note that N -Q-loop has the lowest gradient complexity. This
suggests that the introduction of the inner loop with large N can help to reduce the total computational
complexity.

5 Convergence Analysis of ITD-BiO


In this section, we first provide a unified theory for ITD-BiO, which is applicable for all choices of N , and then
specialize the convergence theory to characterize the computational complexity for the two implementations
of ITD-BiO: No loop and N -N -loop. We also provide a convergence lower bound to justify the necessity
of choosing large N to achieve a vanishing convergence error. The following theorem characterizes the
convergence rate of ITD-BiO for all choices of N .
 L2 4M 2 wN
2
L2
Theorem 3. Suppose Assumptions 1, 2, 3 and 4 hold. Define w = 1 + 2
αµ µ2
(1 − αµ)N λN + µ2
and
2 N 2 N 4M 2 wN
2
+4(1− 1 αµ)L2 (1+αLN )2
τ = N (1 − αµ) + wN + λN (1 − αµ) , where λN and wN are given by λN = 1− 1 αµ−(1−αµ)N (1+ 1 αµ) , wN =
4
4 2
 N  N 1 αµ
αρL(1−(1−αµ) 2 ) N −1 1−(1−αµ) 2 2 1− 4 1
α ρ+ √
1− 1−αµ
(1 − αµ) 2 √
1− 1−αµ
. Choose parameters such that β ≤ 2w , α ≤ 2L and βLΦ +
2L2 +ρM 2 +L3 2
 
8
αµ 2
1 2 1
+ βLΦ wβ < 4 , where LΦ = L + µ + 2ρLMµ2 + ρLµ3M denotes the smoothness parameter of
Φ(·). Then, we have
K−1 2N 2 
1 X ∆
Φ τ ∆y (1 − αµ)2N M 2 1 − αµ L
k∇Φ(xk )k2 ≤O + 2 + + , (5)
K βK µ K µ3 K αµ3
k=0

where ∆Φ = Φ(x0 ) − minx Φ(x) and ∆y = ky0 − y ∗ (x0 )k2 .

In Theorem 3, the upper bound on the convergence rate for ITD-BiO contains a convergent term O( K 1
)
M 2 (1−αµ)2N 
(which converges to zero sublinearly with K) and an error term O αµ3 (which is independent of K,
and possibly non-vanishing if N is chosen to be small). To show that such a possibly non-vanishing error
term (when N is chosen to be small) fundamentally exists, we next provide the following lower bound on the
convergence rate of ITD-BiO.

Theorem 4 (Lower Bound). Consider the ITD-BiO algorithm in Algorithm 2 with α ≤ L1 , β ≤ L1Φ and
N ≤ O(1), where LΦ is the smoothness parameter of Φ(x). There exist objective functions f (x, y) and g(x, y)
that satisfy Assumptions 1, 2, 3 and 4 such that for all iterates xK (where K ≥ 1) generated by ITD-BiO in
2 2 2N 
Algorithm 2, k∇Φ(xK )k2 ≥ Θ L µM 2 1 − αµ .

Clearly, the error term in the upper bound given in Theorem 3 matches the lower bound given in Theorem 4
2 2
in terms of Mµ2L (1 − αµ)2N , and there is still a gap on the order of αµ, which requires future efforts to

9
address. Theorem 3 and Theorem 4 together indicate that in order to achieve an -accurate stationary
point, N has to be chosen as large as N = Θ(κ log κ ). This corresponds to the N -N -loop implementation of
ITD-BiO, where large N achieves a highly accurate hypergradient estimation in each step. Another No-loop
implementation chooses a small constant-level N = Θ(1) to achieve an efficient execution per step, where a
large N can cause large memory usage and computation cost. Following from Theorem 3 and Theorem 4,
such No-loop implementation necessarily suffers from a non-vanishing error.
In the following corollaries, we further specialize Theorem 3 to obtain the complexity analysis for ITD-BiO
under the two aforementioned implementations of ITD-BiO.

Corollary 5 (N -N -loop). Consider N -N -loop ITD-BiO with N = Θ(κ ln κ ). Under the same setting of
q αµ
κ3
np o PK−1
αµ 1− 4 1 1 2

Theorem 3, choose β = min 40w
, 2w
, 8L1Φ , α = 2L . Then, K k=0 k∇Φ(xk )k = O K +  , and the
complexity is Gc() = O(κ
e 4 −1 ), MV() = O(κ
e 4 −1 ).

Corollary 5 shows that for a large N = Θ(κ ln κ ), we can guarantee that ITD-BiO converges to an -
accurate stationary point, and the gradient and matrix-vector product complexities are given by O(κ
e 4 −1 ). We
note that Ji et al. (2021) also analyzed the ITD-BiO with N = Θ(κ ln  ), and provided the same complexities
κ

as our results in Corollary 5. In comparison, our analysis has several differences. First, Ji et al. (2021)
assumed that the minimizer y ∗ (xk ) at the k th iteration is bounded, whereas our analysis does not impose this
2 2 N
assumption. Second, Ji et al. (2021) involved an additional error term maxk=1,...,K ky ∗ (xk )k L M (1−αµ)
µ4 ,
which can be very large (or even unbounded) under standard Assumptions 1, 2, 3 and 4. We next characterize
the convergence for the small N = Θ(1).

Corollary 6 (No-loop). Consider No-loop ITD-BiO with N = Θ(1). Under the same setting of Theorem 3,
1
p αµ q 1− αµ 1 1
PK−1 2
choose stepsizes α = 2N L and β = min 40w , 2w , 8LΦ . Then, we have K k=0 k∇Φ(xk )k =
4

3 2 2
O κK + MαµL3 .

Corollary 6 indicates that for the constant-level N = Θ(1), the convergence bound contains a non-vanishing
2 2
error O( MαµL3 ). As shown in the convergence lower bound in Theorem 4, under standard Assumptions 1,
2, 3 and 4, such an error is unavoidable. Comparison between the above two corollaries suggests that for
ITD-BiO, the N -N -loop is necessary to guarantee a vanishing convergence error, whereas No-loop necessarily
suffers from a non-vanishing convergence error.
Discussion on the setting with small response Jacobian. Our results in Theorem 3 and Theorem 4
apply to the general functions whose first- and second-order derivatives are Lipschitz continuous, i.e., under
Assumptions 2 and 3. Here, we further discuss the extension of our results to another setting where the
response Jacobian is extremely small. This setting occurs in some deep learning applications (Finn et al.,

∂y N (x)
2017; Ji et al., 2020a), where the response Jacobian ∂y∂x(x) (which is estimated by ∂xk
with a large N ) can
be order-of-magnitude smaller than network gradients. Based on eq. (60) and eq. (62) in the appendix, it
∂y ∗ (xk ) 2
can be shown that the convergence error is proportional to the quantity K1 K−1 k=0 k ∂xk k , and hence the
P

constant-level N = Θ(1) can still achieve a small error in this setting.

6 Empirical Verification
Experiments on AID-BiO. We first conduct experiments to verify our theoretical results in Corollaries 1, 2,
3 and 4 on AID-BiO with different implementations. We consider the following hyperparameter optimization

10
problem.
1 X 1 X  λ 
min LDval (λ) = L(w∗ ; ξ), s.t. w∗ = arg min L(w; ξ) + kwk22 ,
λ |Dval | w |Dtr | 2
ξ∈Dval ξ∈Dtr

where Dtr and Dval stand for training and validation datasets, L(w; ξ) denotes the loss function induced by
the model parameter w and sample ξ, and λ > 0 denotes the regularization parameter. The goal is to find a
good hyperparameter λ to minimize the validation loss evaluated at the optimal model parameters for the
regularized empirical risk minimization problem.

8 8
Q=1,N=1 Q=1,N=1
7 Q=20,N=20 7 Q=20, N=20
Q=20,N=1 Q=20, N=1
Training Loss

Q=1,N=20 Q=1, N=20


6 6

Test Loss
5 5

4 4

3 3

2 2
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Running Time Running Time

Figure 1: Training and test losses v.s. running time (seconds) on MNIST with different Q and N .

From Figure 1, we can make the following observations. First, the learning curves with N = 20 are
significantly better than those with N = 1, indicating that running multiple steps of gradient descent in the
inner loop (i.e., N > 1) is crucial for fast convergence. This observation is consistent with our complexity result
that N -loop is better than No-loop, and N -Q-loop is better than Q-loop, as shown in Table 1. The reason is
that a more accurate hypergradient estimation can accelerate the convergence rate and lead to a reduction
on the Jacobian- and Hessian-vector computational complexity. Second, N -Q-loop (N = 20, Q = 20) and
N -loop (N = 20, Q = 1) achieve a comparable convergence performance, and a similar observation can be
made for Q-loop (N = 1, Q = 20) and No-loop (N = 1, Q = 1). This is also consistent with the complexity
result provided in Table 1, where different choices of Q do not affect the dominant matrix-vector complexity.
Experiments on ITD-BiO. We consider a hyper-representation problem in Sow et al. (2021), where
the inner problem is to find optimal regression parameters w and the outer procedure is to find the best
representation parameters λ. In specific, the bilevel problem takes the following form:
1 2 1 γ
min Φ(λ) = kh(XV ; λ)w∗ − YV k , s.t. w∗ = argmin kh(XT ; λ)w − YT k2 + kwk2
λ 2p w 2q 2

where XT ∈ Rq×m and XV ∈ Rp×m are synthesized training and validation data, YT ∈ Rq , YV ∈ Rp are their
response vectors, and h(·) is a linear transformation. The generation of XT , XV , YT , YV and the experimental
setup follow from Sow et al. (2021). We choose N = 20 for N -N -loop ITD and N = 1 for No-loop ITD. The
results are reported with the best-tuned hyperparameters.
Table 3 indicates that N -N -loop with N = 20 can achieve a small loss value of 0.004 after 500 total
iterations, whereas No-loop with N = 1 converges to a much larger loss value of 0.04. This is in consistence
with our theoretical results in Table 2, where N = 1 can cause a non-vanishing error.

11
Algorithm k = 10 k = 50 k = 100 k = 500 k = 1000
N -N -loop ITD 9.32 0.11 0.01 0.004 0.004
No-loop ITD 435 6.9 0.04 0.04 0.04

Table 3: Validation loss v.s. the number of iterations for ITD-based algorithms.

7 Conclusion
In this paper, we study two popular bilevel optimizers AID-BiO and ITD-BiO, whose implementations
potentially involve additional loops of iterations within their base-loop update. By developing unified
convergence analysis for all choices of the loop parameters, we are able to provide formal comparison among
different implementations. Our result suggests that N -loops are beneficial for better computational efficiency
for AID-BiO and for better convergence accuracy for ITD-BiO. This is in contrast to conventional minimax
optimization, where No-loop (i.e., single-base-loop) scheme achieves better computational efficiency. Our
analysis techniques can be useful to study other bilevel optimizers such as stochastic optimizers and variance
reduced optimizers.

References
Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable
closed-form solvers. In International Conference on Learning Representations (ICLR), 2018.

Tianyi Chen, Yuejiao Sun, and Wotao Yin. A single-timescale stochastic bilevel optimization method. arXiv
preprint arXiv:2102.04671, 2021.

Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics
(AISTATS), pp. 318–326, 2012.

Matthias Feurer and Frank Hutter. Hyperparameter optimization. In Automated Machine Learning, pp. 3–33.
Springer, Cham, 2019.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In Proc. International Conference on Machine Learning (ICML), pp. 1126–1135, 2017.

Rémi Flamary, Alain Rakotomamonjy, and Gilles Gasso. Learning constrained task similarities in graphregu-
larized multi-task learning. Regularization, Optimization, Kernels, and Support Vector Machines, pp. 103,
2014.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based
hyperparameter optimization. In International Conference on Machine Learning (ICML), pp. 1165–1173,
2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming
for hyperparameter optimization and meta-learning. In International Conference on Machine Learning
(ICML), pp. 1568–1577, 2018.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprint
arXiv:1802.02246, 2018.

12
Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of
hypergradient computation. In Proc. International Conference on Machine Learning (ICML), 2020.

Zhishuai Guo and Tianbao Yang. Randomized stochastic variance-reduced methods for stochastic bilevel
optimization. arXiv preprint arXiv:2105.02266, 2021.

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. On stochastic moving-average estimators for
non-convex optimization. arXiv preprint arXiv:2104.14840, 2021.

Pierre Hansen, Brigitte Jaumard, and Gilles Savard. New branch-and-bound rules for linear bilevel program-
ming. SIAM Journal on Scientific and Statistical Computing, 13(5):1194–1217, 1992.

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel
optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170, 2020.

Minhui Huang, Kaiyi Ji, Shiqian Ma, and Lifeng Lai. Efficiently escaping saddle points in bilevel optimization.
arXiv preprint arXiv:2202.03684, 2022.

Kaiyi Ji and Yingbin Liang. Lower bounds and accelerated algorithms for bilevel optimization. arXiv preprint
arXiv:2102.03926, 2021.

Kaiyi Ji, Jason D Lee, Yingbin Liang, and H Vincent Poor. Convergence of meta-learning with task-specific
adaptation over partial parameters. arXiv preprint arXiv:2006.09486, 2020a.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Multi-step model-agnostic meta-learning: Convergence and
improved algorithms. arXiv preprint arXiv:2002.07836, 2020b.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design.
In International Conference on Machine Learning, pp. 4882–4892. PMLR, 2021.

Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A
near-optimal algorithm for stochastic bilevel optimization via double-momentum. arXiv preprint
arXiv:2102.07367, 2021.

Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing
systems (NeurIPS), pp. 1008–1014, 2000.

Gautam Kunapuli, Kristin P Bennett, Jing Hu, and Jong-Shi Pang. Classification model selection via bilevel
programming. Optimization Methods & Software, 23(4):475–489, 2008.

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Junyi Li, Bin Gu, and Heng Huang. Improved bilevel model: Fast and optimal algorithm with theoretical
guarantee. arXiv preprint arXiv:2009.00690, 2020.

Junyi Li, Bin Gu, and Heng Huang. A fully single loop algorithm for bilevel optimization without hessian
inverse. arXiv preprint arXiv:2112.04660, 2021.

Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax
problems. In International Conference on Machine Learning (ICML), pp. 6083–6093. PMLR, 2020.

13
Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A generic first-order algorithmic
framework for bi-level programming beyond lower-level singleton. In International Conference on Machine
Learning (ICML), 2020.

Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A value-function-based interior-point
method for non-convex bi-level optimization. In International Conference on Machine Learning (ICML),
2021a.

Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with
non-convex followers and beyond. Advances in Neural Information Processing Systems (NeurIPS), 34,
2021b.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through
reversible learning. In International Conference on Machine Learning (ICML), pp. 2113–2122, 2015.

Gregory M Moore. Bilevel programming algorithms for machine learning model selection. Rensselaer
Polytechnic Institute, 2010.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on


Machine Learning (ICML), pp. 737–746, 2016.

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit
gradients. In Advances in Neural Information Processing Systems (NeurIPS), pp. 113–124, 2019.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for
bilevel optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp.
1723–1732, 2019.

Chenggen Shi, Jie Lu, and Guangquan Zhang. An extended kuhn–tucker approach for linear bilevel
programming. Applied Mathematics and Computation, 162(1):51–63, 2005.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in
Neural Information Processing Systems (NIPS), 2017.

Daouda Sow, Kaiyi Ji, and Yingbin Liang. Es-based jacobian enables faster bilevel optimization. arXiv
preprint arXiv:2110.07004, 2021.

Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A constrained optimization approach to bilevel
optimization with multiple inner minima. arXiv preprint arXiv:2203.01123, 2022.

Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in
Neural Information Processing Systems (NeurIPS), 34, 2021.

Jiawei Zhang, Peijun Xiao, Ruoyu Sun, and Zhiquan Luo. A single-loop smoothed gradient descent-ascent
algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems
(NeurIPS), 33:7377–7389, 2020.

14
Supplementary Materials

A Expanded Related Work


Gradient-based bilevel optimization. A number of gradient-based bilevel algorithms have been proposed
via AID- and ITD-based hypergradient approximations. For example, AID-based hypergradient computa-
tion (Domke, 2012; Pedregosa, 2016; Ghadimi & Wang, 2018; Grazzi et al., 2020; Ji et al., 2021; Huang
et al., 2022) estimates the Hessian-inverse-vector product by solving a linear system with an efficient iterative
algorithm. ITD-based hypergradient computation (Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Finn
et al., 2017; Shaban et al., 2019; Ji et al., 2020a) involves a backpropagation over the inner-loop gradient-based
optimization path. Convergence rate of AID- and ITD-based bilevel methods has been studied recently. For
example, Ghadimi & Wang (2018); Ji et al. (2021) and Ji et al. (2021, 2020a) analyzed the convergence rate
and complexity of AID- and ITD-based bilevel algorithms, respectively. Ji & Liang (2021) characterized the
lower complexity bounds for a class of gradient-based bilevel algorithms. As we mentioned before, previous
studies on the convergence rate of deterministic AID-BiO (Ghadimi & Wang, 2018; Ji et al., 2021) focused
only on N -Q-loop, and the only convergence rate analysis on ITD-BiO (Ji et al., 2021) was for N -N -loop.
Our study here develops unified convergence analysis for all N and Q regimes.
Some works (Liu et al., 2020, 2021a; Li et al., 2020; Sow et al., 2022) studied the convex inner-level
objective function with multiple minimizers. Liu et al. (2021b) proposed an initialization auxiliary method
for the setting where the inner-level problem is generally nonconvex.
Stochastic bilevel optimization. A variety of stochastic bilevel optimization algorithms have been
proposed recently. For example, Ghadimi & Wang (2018); Hong et al. (2020); Ji et al. (2021) proposed
stochastic gradient descent (SGD) type of bilevel algorithms, and analyzed their convergence rate and
complexity. Some works (Guo & Yang, 2021; Guo et al., 2021; Yang et al., 2021; Khanduri et al., 2021; Chen
et al., 2021) then further improved the complexity of SGD type methods using techniques such as variance
reduction, momentum acceleration and adaptive learning rate. Sow et al. (2021) proposed a Hessian-free
stochastic Evolution Strategies (ES)-based bilevel algorithm with performance guarantee. Although our
study mainly focuses on determinstic bilevel optimization, our techniques can be extended to provide refined
analysis for stochastic bilevel optimization to capture the order scaling with κ, which is not captured in most
of the above studies on stochastic bilevel optimization.
Bilevel optimization for machine learning. Bilevel optimization has shown promise in many machine
learning applications such as hyperparameter optimization (Pedregosa, 2016; Franceschi et al., 2018; Ji et al.,
2021) and few-shot meta-learning (Finn et al., 2017; Snell et al., 2017; Rajeswaran et al., 2019; Franceschi
et al., 2018; Bertinetto et al., 2018; Ji et al., 2020a,b,a). For example, Snell et al. (2017); Bertinetto et al.
(2018) introduced an outer-level procedure to learn a common embedding model for all tasks. Ji et al. (2020a)
analyzed the convergence rate for meta-learning with task-specific adaptation on partial parameters.

B Further Specifications on Hyperparameter Optimization Experi-


ments
We follow the setting of Yang et al. (2021) to setup the experiment. We first randomly sample 20000 training
samples and 10000 test samples from MNIST dataset (LeCun et al., 1998) with 10 classes, and then add a
label noise on 10% of the data. The label noise is uniform across all labels from label 0 to label 9. We test
algorithms with different values of Q and N to verify our theoretical results. Every algorithm’s learning rates

15
for inner and outer loops are tuned from the range of {0.1, 0.01, 0.001} and we report the result with the
best-tuned learning rates. We run 5 random seeds and report the average result. All experiments are run
over a single NVIDIA Tesla P100 GPU. The implementations of our experiments are based on the code of Ji
et al. (2021), which is under MIT License.

C Proof Sketch of Theorem 1


The proof of Theorem 1 contains three major steps, which include 1) decomposing the hypergradient
approximation error into the N -loop error in estimating the inner-level solution and the Q-loop error in
solving the linear system approximately, 2) upper-bounding such two types of errors based on the hypergradient
approximation errors at previous iterations, and 3) combining all results in the previous steps and proving
the convergence guarantee. More detailed steps can be found as below.
Step 1: decomposing hypergradient approximation error.
We first show that the hypergradient approximation error at the k th iteration is bounded by
 3ρ2 M 2  ∗
k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k ≤ 3L + kyk − ykN k2 + 3L2 kvk∗ − vkQ k2 . (6)
µ2 | {z }
| {z } Q-loop estimation error
N -loop estimation error

where the right hand side contains two types of errors induced by solving the inner-level problem and
outer-level linear system. Note that for general choices of N and Q, such two errors cannot be guaranteed to
be sufficiently small, but fortunately we show via the following results that such errors contain iteratively
decreasing components which facilitate the final convergence.
Step 2: upper-bounding linear system approximation error.
We then show that the Q-loop error kvk∗ − vkQ k2 for solving the linear system is bounded by

kvkQ − vk∗ k2 ≤ O (1 + ηµ)(1 − ηµ)2Q + wβ 2 kvk−1
 Q ∗
− vk−1 k2


+ (η 2 (1 − αµ)N + wβ 2 )kyk−1 N
− yk−1 k2 + wβ 2 k∇Φ(xk−1 )k2 . (7)

Note that if the stepsize β is chosen to be sufficiently small, the right hand side of eq. (7) contains an iteratively
decreasing term (1 + ηµ)(1 − ηµ)2Q + wβ 2 kvk−1 k2 , an error term (η 2 (1 − αµ)N + wβ 2 )kyk−1
 Q ∗ ∗ N
− vk−1 − yk−1 k2
induced by the N -loop updates, and gradient norm term wβ k∇Φ(xk−1 )k that captures the increment
2 2

between two adjacent iterations. Similarly, we upper-bound the N -loop updating error kyk∗ − ykN k2 by

kykN − yk∗ k2 ≤ O (1 + λ)(1 − αµ)N + (1 + λ−1 )β 2 kyk−1 ∗
 N
− yk−1 k2

Q
+ (1 + λ−1 )β 2 kvk−1 ∗
− vk−1 k2 + (1 + λ−1 )β 2 k∇Φ(xk−1 )k2 , (8)

where τ = 1 + λ1 is inversely proportional to λ. Note that we introduce an auxiliary variable λ in the first
error term at the right hand side of eq. (8) to allow for a general choice of N . To see this, to guarantee
that (1 + λ)(1 − αµ)N + (1 + λ−1 )β 2 < 1, a larger N allows for a smaller λ. As a result, the outer-level stepsize
β can be chosen more aggressively, which hence yields a faster convergence rate but at a cost of N steps
of N -loop updates. On the other hand, if N is chosen to be small, e.g., N = 1, λ needs to be as small as
λ = Θ(αµ). As a result, β needs to be smaller, and hence yields a slower convergence rate but with a more
efficient N -loop update.

16
Step 3: combining Steps 1 and 2.
Combining eq. (6), eq. (7) and eq. (8), we upper-bound the hypergradient estimation error as
k−1
X
2 k 2
(1 − τ )j k∇Φ(xk−1−j )k2 ,

k∇Φ(x
b k ) − ∇Φ(xk )k ≤O (1 − τ ) + ωβ
j=0

which, combined with the LΦ -smoothness property of Φ(·) and a proper choice of β, yields the final convergence
result.

D Proof of Theorem 1
We first provide some auxiliary lemmas to characterize the hypergradient approximation errors.

Lemma 1. Suppose Assumptions 1, 2, 3 and 4 are satisfied. Let vk∗ = (∇2y g(xk , yk∗ ))−1 ∇y f (xk , yk∗ ) with
yk∗ = arg miny g(xk , y). Then, we have

kvkQ − vk∗ k2 ≤(1 + ηµ)(1 − ηµ)2Q kvk−1


Q ∗
− vk−1 k2
 1  2 ∗
+2 1+ C ky − ykN k2
ηµ Q k
 1  L M ρ 2  L 2
+ 2(1 − ηµ)2Q 1 + + 2 + 1 kxk − xk−1 k2 ,
ηµ µ µ µ
Q(1−ηµ)Q−1 ρM η 1−(1−ηµ)Q (1+ηQµ)
where CQ = µ
+ µ2
ρM + (1 − (1 − ηµ)Q ) L
µ
.

Proof. Let vkq be the q th (q = 0, ..., Q − 1) GD iterate via solving the linear system ∇2y g(xk , ykN )v =
∇y f (xk , ykN ), which can be written in the following iterative way.

vkq+1 = (I − η∇2y g(xk , ykN ))vkq + η∇y f (xk , ykN ). (9)

Then, by telescoping eq. (9) over q from 0 to Q yields


Q−1
X
vkQ = (I − η∇2y g(xk , ykN ))Q vk0 + η (I − ηy2 g(xk , ykN ))q ∇y f (xk , ykN ). (10)
q=0

Similarly, based on the definition of vk∗ , it can be derived that the following equation holds.
Q−1
X
vk∗ = (I − η∇2y g(xk , yk∗ ))Q vk∗ + η (I − ηy2 g(xk , yk∗ ))q ∇y f (xk , yk∗ ). (11)
q=0

Combining eq. (9) and eq. (10), we next characterize the difference between the estimate vkQ and the underlying

17
truth vk∗ . In specific, we have

(i)
kvkQ − vk∗ k ≤ (I − η∇2y g(xk , ykN ))Q − (I − η∇2y g(xk , yk∗ ))Q kvk∗ k + (1 − ηµ)Q kvk0 − vk∗ k
Q−1
X Q−1
X
+η (I − ηy2 g(xk , ykN ))q − (I − ηy2 g(xk , yk∗ ))q k∇y f (xk , yk∗ )k
q=0 q=0
Q−1
X
+ ηL (I − ηy2 g(xk , ykN ))q kyk∗ − ykN k
q=0
(ii) M Q
≤ (I − η∇2y g(xk , ykN ))Q − (I − η∇2y g(xk , yk∗ ))Q + (1 − ηµ)Q kvk−1 − vk∗ k
µ
Q−1
X Q−1
X
+ ηM (I − ηy2 g(xk , ykN ))q − (I − ηy2 g(xk , yk∗ ))q
q=0 q=0
L
+ (1 − (1 − ηµ)Q ) kyk∗ − ykN k. (12)
µ

where (i) follows from the strong convexity of g(x, ·) and (ii) follows from Assumption 4, the warm start
Q
initialization vk0 = vk−1 and kvk∗ k ≤ k(∇2y g(xk , yk∗ ))−1 kk∇y f (xk , yk∗ )k ≤ M
µ . We next provide an upper bound
on the quantity ∆q := k(I − ηy g(xk , yk )) − (I − ηy g(xk , yk )) k in eq. (12). In specific, we have
2 N q 2 ∗ q

(i)
∆q ≤(1 − ηµ)∆q−1 + (1 − ηµ)q−1 ηk∇2y g(xk , yk∗ ) − ∇2y g(xk , ykN )k
≤(1 − ηµ)∆q−1 + (1 − ηµ)q−1 ηρkykN − yk∗ k. (13)

where (i) follows from the strong convexity of g(x, ·) and Assumption 3. Telescoping eq. (13) yields

∆q ≤ (1 − ηµ)q ∆0 + q(1 − ηµ)q−1 ηρkykN − yk∗ k = q(1 − ηµ)q−1 ηρkykN − yk∗ k,

which, in conjunction with eq. (12), yields

M N
kvkQ − vk∗ k ≤Q(1 − ηµ)Q−1 ηρ Q
ky − yk∗ k + (1 − ηµ)Q kvk−1 − vk∗ k
µ k
Q−1
X L
+ ηM q(1 − ηµ)q−1 ηρkykN − yk∗ k + (1 − (1 − ηµ)Q ) kyk∗ − ykN k. (14)
q=0
µ

PQ−1 1−xQ −QxQ−1 +QxQ


Based on the facts that q=0 qxq−1 = (1−x)2 > 0, we obtain from eq. (14) that

Q(1 − ηµ)Q−1 ρM η N
kvkQ − vk∗ k ≤ Q
kyk − yk∗ k + (1 − ηµ)Q kvk−1 ∗
− vk−1 k
µ
∗ 1 − (1 − ηµ)Q (1 + ηQµ)
+ (1 − ηµ)Q kvk−1 − vk∗ k + ρM kykN − yk∗ k
µ2
L
+ (1 − (1 − ηµ)Q ) kyk∗ − ykN k
µ

which, in conjunction with kvk∗ − vk−1



k≤ L Mρ
µ + 1 kxk − xk−1 k and using the Young’s inequality
 L 
µ + µ2
that ka + bk2 ≤ (1 + ηµ)kak2 + (1 + ηµ1
)kbk2 , completes the proof of Lemma 1.

18
Lemma 2. Suppose Assumptions 1 and 2 are satisfied.
 1  L2
kyk∗ − ykN k2 ≤ (1 − αµ)N (1 + λ)kyk−1
N ∗
− yk−1 k2 + (1 − αµ)N 1 + kxk − xk−1 k2 (15)
λ µ2
where λ is a positive constant.
Proof. Note that yk∗ = arg miny g(xk , y). Using the strong convexity (i.e., Assumption 1) and smoothness
(i.e., Assumption 2) of g(xk , ·), we have

kykN − yk∗ k2 ≤ (1 − αµ)N kyk0 − yk∗ k2 , (16)

which, in conjunction with the warm start initialization yk0 = yk−1


N
and using the Young’s inequality, yields
 1
kykN − yk∗ k2 ≤(1 + λ)(1 − αµ)N kyk−1
N ∗
− yk−1 k2 + 1 + ∗
(1 − αµ)N kyk−1 − yk∗ k2
λ
(i)  1 L2
≤(1 + λ)(1 − αµ)N kyk−1
N ∗
− yk−1 k2 + 1 + (1 − αµ)N 2 kxk−1 − xk k2 , (17)
λ µ
where (i) follows from Lemma 2.2 in Ghadimi & Wang (2018).

Lemma 3. Suppose Assumptions 1, 2, 3 and 4 are satisfied. Choose parameters such that (1 + λ)(1 −
2
1 CQ
αµ)N (1 + 4r(1 + 2
ηµ )L ) ≤ 1 − ηµ, where the notation r = ( ρM 2
with CQ given in Lemma 1. Then, we
µ +L)
have the following inequality.
2 2 2 2 k
k∇Φ(x
b k ) − ∇Φ(xk )k ≤3L (1 − ηµ + 6wL β ) δ0
k−1
X
+ 6wL2 β 2 (1 − ηµ + 6wL2 β 2 )j k∇Φ(xk−1−j )k2 , (18)
j=0

ρ2 M 2
ky0N − y0∗ k2 + kv0Q − v0∗ k2 and the notation w is given by

where δ0 := 1 + L2 µ2
 1  ρ2 M 2  L2
w = 1+ (1 − αµ)N 1 + 2 2
λ L µ µ2
4 2 2 
 1 L
 ρ M 4(1 − ηµ)2Q N
 1 
+4 1+ 1 + + r(1 − αµ) 1 + . (19)
ηµ µ2 L2 µ2 µ2 λ
Proof. Combining Lemma 1 and Lemma 2, we have

kvkQ − vk∗ k2 ≤(1 + ηµ)(1 − ηµ)2Q kvk−1


Q ∗
− vk−1 k2
 1  2 N
+ 2(1 − αµ)N (1 + λ) 1 + C ky − yk∗ k2
ηµ Q k−1
 1  1  2 L2
+ 2(1 − αµ)N 1 + 1+ C kxk−1 − xk k2
λ ηµ Q µ2
 1  L M ρ 2  L 2
+ 2(1 − ηµ)2Q 1 + + 2 + 1 kxk − xk−1 k2 ,
ηµ µ µ µ
2
2 CQ
which, in conjunction with ( L
µ + 1) ≤ 4 µ2 and the notation r =
2 L
( ρM 2
, yields
µ +L)

kvkQ −vk∗ k2 ≤ (1 + ηµ)(1 − ηµ)2Q kvk−1


Q ∗
− vk−1 k2
 1  L2  ρM 2  4(1 − ηµ)2Q N
 1 
+2 1+ L + + r(1 − αµ) 1 + kxk − xk−1 k2
ηµ µ2 µ µ2 λ
 1  ρM 2
+ 2(1 + λ)(1 − αµ)N 1 + + L rkyk−1N ∗
− yk−1 k2 . (20)
ηµ µ

19
Then, combining Lemma 2 and eq. (20), we have
 ρ2 M 2  N
1+ kyk − yk∗ k2 + kvkQ − vk∗ k2
L2 µ2
 ρ2 M 2  N ∗
≤(1 + λ)(1 − αµ)N 1 + 2 2 kyk−1 − yk−1 k2
L µ
 1  ρ2 M 2  L2
+ 1+ (1 − αµ)N 1 + 2 2 kxk−1 − xk k2
λ L µ µ2
Q ∗
+ (1 + ηµ)(1 − ηµ)2Q kvk−1 − vk−1 k2
 1   ρ2 M 2  ∗
+4 1+ (1 + λ) L2 + (1 − αµ)N rkyk−1
N
− yk−1 k2
ηµ µ2
 1  L4  ρ2 M 2  4(1 − ηµ)2Q N
 1 
+4 1+ 1 + + r(1 − αµ) 1 + kxk−1 − xk k2
ηµ µ2 µ2 L2 µ2 λ

which, in conjunction with the definition of w in eq. (19), yields


 ρ2 M 2  N
1+ kyk − yk∗ k2 + kvkQ − vk∗ k2
L2 µ2
 ρ2 M 2   1  2 N ∗
≤(1 + λ)(1 − αµ)N 1 + 2 2 1 + 4r 1 + L kyk−1 − yk−1 k2
L µ ηµ
Q
+ (1 + ηµ)(1 − ηµ)2Q kvk−1 ∗
− vk−1 k2 + wkxk−1 − xk k2 . (21)

ρ2 M 2
For notational convenience, we define δk := 1 + kykN − yk∗ k2 + kvkQ − vk∗ k2 as the per-iteration error

L2 µ2
induced by ykN and vkQ . Then, recalling that (1 + λ)(1 − αµ)N (1 + 4r(1 + 1 2
ηµ )L ) ≤ 1 − ηµ, we obtain from
eq. (21) that

δk ≤(1 − ηµ)δk−1 + 2wβ 2 k∇Φ(xk−1 ) − ∇Φ(x


b 2 2 2
k−1 )k + 2wβ k∇Φ(xk−1 )k . (22)

Based on the form of ∇Φ(x


b k ) and ∇Φ(xk ) in eq. (3) and eq. (2), we have

2 ∗ N 2 N 2 ∗ Q 2
k∇Φ(x
b k ) − ∇Φ(xk )k ≤3k∇x f (xk , yk ) − ∇x f (xk , yk )k + 3k∇x ∇y g(xk , yk )k kvk − vk k

+ 3k∇x ∇y g(xk , yk∗ ) − ∇x ∇y g(xk , ykN )k2 kvk∗ k2 ,

which, in conjunction with Assumptions 1, 2, 3 and 4, yields


 3ρ2 M 2  ∗
k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k ≤ 3L + kyk − ykN k2 + 3L2 kvk∗ − vkQ k2 . (23)
µ2

Substituting eq. (23) into eq. (22) yields

δk ≤(1 − ηµ + 6wL2 β 2 )δk−1 + 2wβ 2 k∇Φ(xk−1 )k2 ,

which, by telescoping and using eq. (23), finishes the proof.

20
Proof of Theorem 1
2L2 +ρM 2
First, based on Lemma 2 in Ji et al. (2021), we have ∇Φ(·) is LΦ -Lipschitz, where LΦ = L + µ +
2ρLM +L3 ρL2 M
µ2 + µ3 = Θ(κ ). Then, we have
3


Φ(xk+1 ) ≤Φ(xk ) + h∇Φ(xk ), xk+1 − xk i + kxk+1 − xk k2
2
β  β 
≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ k∇Φ(xk ) − ∇Φ(x
b k )k
2
2 2
(i) β  β 
≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ 3L2 δ0 (1 − ηµ + 6wL2 β 2 )k
2 2
β  k−1
X
+ 6wL2 β 2 + β 2 LΦ (1 − ηµ + 6wL2 β 2 )j k∇Φ(xk−1−j )k2 , (24)
2 j=0

where (i) follows from Lemma 3, δ0 is defined in Lemma 3 and w is given by eq. (19). Then, telescoping
eq. (24) over k from 0 to K − 1, denoting x∗ = arg minx Φ(x) and using, we have
β  K−1
X
−β 2 LΦ k∇Φ(xk )k2
2
k=0

3L2 δ0 ( β2 + β 2 LΦ )
≤Φ(x0 ) − Φ(x∗ ) +
ηµ − 6wL2 β 2
β  K−1
X k−1X
+ 6wL2 β 2 + β 2 LΦ (1 − ηµ + 6wL2 β 2 )j k∇Φ(xk−1−j )k2
2
k=0 j=0

(i) 3L2 δ0 ( β2 + β 2 LΦ )  PK−1 k∇Φ(xj )k2


2 2 β

j=0

≤Φ(x0 ) − Φ(x ) + + 6wL β 2
+ β LΦ (25)
ηµ − 6wL2 β 2 2 ηµ − 6wL2 β 2
where (i) follows because bj . Rearranging eq. (25) yields
PK−1 Pk−1 PK−1 PK−1
k=0 j=0 aj bk−1−j ≤ k=0 ak j=0

K−1
1 6wL2 β 2 ( 21 + βLΦ )  1 X
− βLΦ − k∇Φ(xk )k2
2 ηµ − 6wL2 β 2 K
k=0

Φ(x0 ) − Φ(x ) 3L2 δ0 ( 12 + βLΦ ) 1



≤ + . (26)
βK ηµ − 6wL2 β 2 K
Note that (1 + λ)(1 − αµ)N (1 + 4r(1 + 1 2
ηµ )L ) ≤ 1 − ηµ and r > 1, we have
 1  1 − ηµ 3η 2 (1 + λ1 ) 1 − ηµ η 3 µ
3η 2 (1 − αµ)N 1 + ≤ 1 ≤ , (27)
λ 1 + λ 1 + 4r(1 + ηµ )L2 λ rL2

which, combined with the definitions of w and w


e given by eq. (19) and theorem 1, yields w ≤ w.
e Then, since
6wL2 β 2 e 2 β2
we set 6wL
e β ≤ 3 in Theorem 1, we have ηµ−6wL2 β 2 < ηµ−6wL
2 2 ηµ 6wL
e 2 β 2 < 2 , which, combined with eq. (26),
1

yields
1 3  1 K−1
X Φ(x0 ) − Φ(x∗ ) 9L2 δ0 ( 12 + βLΦ )
− βLΦ k∇Φ(xk )k2 ≤ + ,
4 2 K βK 2ηµK
k=0

which, in conjunction with β ≤ 12LΦ ,


1
yields
K−1
1 X 8(Φ(x0 ) − Φ(x∗ )) 21L2 δ0
k∇Φ(xk )k2 ≤ + . (28)
K βK ηµK
k=0

21
Based on the updates of y and v, we have

ky0N − y0∗ k2 ≤ky00 − y0∗ k2 = ky0∗ k2


kv0Q − v0∗ k ≤kv0∗ k + kv0Q − (∇2y g(x0 , y0N ))−1 ∇y f (x0 , y0N )k + k(∇2y g(x0 , y0N ))−1 ∇y f (x0 , y0N )k
(i) M 2
≤ + (Lky0∗ k + M ), (29)
µ µ
2 2
where (i) follows because the initialization v00 = 0 and y00 = 0. Substituting eq. (29) into δ0 := 1+ ρL2M N
µ2 ky0 −
y0∗ k2 + kv0Q − v0∗ k2 and eq. (28), we complete the proof.

E Proof of Corollary 1
In this case, first note that all choices of η, α, λ and N satisfy the conditions in Theorem 1. First recall that
2
CQ
r= ( ρM +L)2
, where
µ

Q(1 − ηµ)Q−1 ρM η 1 − (1 − ηµ)Q (1 + ηQµ) L


CQ = + ρM + (1 − (1 − ηµ)Q ) ,
µ µ2 µ
which, combined with Q = Θ(1) and η = Θ(1), yields CQ 2
= Θ(κ2 ) and hence r = Θ(1). Note that
2 2 2 2 2 2Q  L2
e := (1−ηµ)ηµ 1 + ρL2M L 1 ρ M 16(1−ηµ)
+ 4(1−ηµ)ηµ µ2 , which, combined with η = L
1
 2
w 3λrL2 µ2 µ2 + 1 + ηµ L + µ2 µ2 3λL2
and λ = 1, yields w
e = Θ(κ3 + κ7 ) = Θ(κ7 ). Based on the choice of β, we have
r
 1 ηµ
β = min , = Θ(κ−4 ).
12LΦ 18L2 w
e
Then, we have the following convergence result.
K−1
1 X κ4 κ3 
k∇Φ(xk )k2 = O + .
K K K
k=0

Then, to achieve an -accurate stationary point, we have K = O(κ4 −1 ), and hence we have the following
complexity results.

• Gradient complexity: Gc() = K(N + 2) = O(κ


e 5 −1 ).

• Matrix-vector product complexities:

MV() = K + KQ = O
e κ4 −1 .


Then, the proof is complete.

F Proof of Corollary 2
2
CQ
Based on the choices of α, λ and η ≤ µQ ,
1
recalling r = ( ρM +L)2
and using the inequality that (1−x)Q ≥ 1−Qx
µ
for any 0 < x < 1, we have

( ρMµηQ + η 2 Q2 ρM + ηQL)2
r≤ ≤ 4η 2 Q2 ,
( ρM
µ + L)
2

22
2
which, in conjunction with η ≤ 128 Q2 L2 ,
1 αµ
yields
1 1 2 2 2
(1 + λ)(1 − αµ)N (1 + 4r(1 + )L2 ) ≤ (1 + λ)(1 − αµ)N (1 + 16(1 + )η Q L )
ηµ ηµ
αµ
≤1− ≤ 1 − ηµ,
4
and hence all requirements in Theorem 1 are satisfied. Also, similarly to the proof of Corollary 1, we have
r = Θ(1), which, combined with η = Θ(κ−2 ), yields w e = Θ(κ6 + κ9 ) = Θ(κ9 ), and hence
r
 1 ηµ
β = min , = Θ(κ−6 ).
12LΦ 18L2 we
Then, we have the following convergence result.
K−1
1 X κ6 κ5 
k∇Φ(xk )k2 = O + .
K K K
k=0

Then, to achieve an -accurate stationary point, we have K = O(κ6 −1 ), and hence we have the following
complexity results.
• Gradient complexity: Gc() = 3K = O(κ
e 6 −1 ).

• Matrix-vector product complexities:

MV() = K + KQ = O
e κ6 −1 .


Then, the proof is complete.

G Proof of Theorem 2
Using an approach similar to eq. (14) in Lemma 1, we have

kvkQ − vk∗ k2 ≤ 2CQ


2
kyk∗ − ykN k2 + 2(1 − ηµ)2Q kvk0 − vk∗ k2 , (30)

where CQ is defined in Lemma 1. Using the zero initialization vk0 and based on the fact that kvk∗ k ≤ µ ,
M
we
obtain from eq. (30) that
2(1 − ηµ)2Q M 2
kvkQ − vk∗ k2 ≤ 2CQ
2
kyk∗ − ykN k2 + ,
µ2
which, in conjunction with eq. (23), yields
 3ρ2 M 2  6L2 (1 − ηµ)2Q M 2
k∇Φ(x
b 2
k ) − ∇Φ(xk )k ≤ 3L +
2
2
+ 6L2 2
C N ∗ 2
Q kyk − yk k + . (31)
µ µ2
Then, substituting eq. (31) into Lemma 2, and using the definition of τ in Theorem 2, we have
 1  L2 2
kyk∗ − ykN k2 ≤(1 − αµ)N (1 + λ)kyk−1
N ∗
− yk−1 k2 + 2(1 − αµ)N 1 + β k∇Φ(xk−1 )k2
λ µ2
 1  L2 2 b
+ 2(1 − αµ)N 1 + β k∇Φ(xk−1 ) − ∇Φ(xk−1 )k2
λ µ2
N ∗
 1  L2 2
≤τ kyk−1 − yk−1 k2 + 2(1 − αµ)N 1 + β k∇Φ(xk−1 )k2
λ µ2
 1  L4 M 2 2
+ 12(1 − αµ)N 1 + β (1 − ηµ)2Q . (32)
λ µ4

23
Telescoping eq. (32) over k yields
k−1
 1  L2 2 X j
kyk∗ − ykN k2 ≤τ k ky0∗ − y0N k2 + 2(1 − αµ)N 1 + β τ k∇Φ(xk−1−j )k2
λ µ2 j=0
12  1  L4 M 2 2
+ (1 − αµ)N 1 + β (1 − ηµ)2Q ,
1−τ λ µ4

which, in conjunction with eq. (31), ky0∗ − y0N k2 ≤ (1 − αµ)N ky0 − y0∗ k2 , the notation of w in Theorem 2 and
2 2
δ0 = 3 L2 + ρ µM + 2L2 CQ
2
(1 − αµ)N ky0∗ − y0 k2 , yields

2

k−1
2 k 2 2Q M2 2
X
k∇Φ(x
b k ) − ∇Φ(xk )k ≤δ0 τ + 6L (1 − ηµ) 2
+ wβ τ j k∇Φ(xk−1−j )k2
µ j=0

6wL2 M 2
+ (1 − ηµ)2Q β 2 . (33)
(1 − τ )µ2

Then, using an approach similar to eq. (24), we have


β  β 
Φ(xk+1 ) ≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ k∇Φ(xk ) − ∇Φ(x
b k )k
2
2 2
(i) β  β 
≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ δ0 τ k
2 2
β  k−1
X 6L2 M 2  β 
+ wβ 2 + β 2 LΦ τ j k∇Φ(xk−1−j )k2 + 2
+ β 2
LΦ (1 − ηµ)2Q
2 j=0
µ 2
β  6wL2 M 2
+ + β 2 LΦ (1 − ηµ)2Q β 2 , (34)
2 (1 − τ )µ2

where (i) follows from eq. (33). Then, rearranging the above eq. (34), we have

1 1  K−1
X
− βLΦ k∇Φ(xk )k2
K 2
k=0
Φ(x0 ) − Φ(x∗ ) 1 1  δ
0
≤ + + βLΦ
βK K 2 1−τ
1  1 K−1
X k−1
X 6L2 M 2  1 
+ wβ 2 + βLΦ τ j k∇Φ(xk−1−j )k2 + 2
+ βLΦ (1 − ηµ)2Q
2 K j=0
µ 2
k=0
1  6wL2 M 2
+ + βLΦ (1 − ηµ)2Q β 2 ,
2 (1 − τ )µ2

which, in conjunction with the inequality that bj , yields


PK−1 Pk−1 PK−1 PK−1
k=0 j=0 aj bk−1−j ≤ k=0 ak j=0

1 1 K−1
1 1 X

−βLΦ − wβ 2 + βLΦ k∇Φ(xk )k2
2 2 1−τ K
k=0
Φ(x0 ) − Φ(x∗ ) 1 1  δ
0 6L2 M 2  1 
≤ + + βLΦ + + βLΦ (1 − ηµ)2Q
βK K 2 1−τ µ2 2
1  6wL2 M 2
+ + βLΦ (1 − ηµ)2Q β 2 . (35)
2 (1 − τ )µ2

24
 
Using βLΦ + wβ 2 1
2 + βLΦ 1
1−τ ≤ 1
4 in the above eq. (35) yields
K−1
1 X 4(Φ(x0 ) − Φ(x∗ )) 3 δ0 27L2 M 2
k∇Φ(xk )k2 ≤ + + (1 − ηµ)2Q ,
K βK K 1−τ µ2
k=0

which finishes the proof.

H Proof of Corollary 3
Note that we choose N = cn κ ln κ and Q = cq κ ln κ . Then, for proper constants cn and cq , we have βLΦ < 18 ,
CQ = Θ(κ2 ), τ = Θ(1) and wβ 2 21 + βLΦ 1−τ < 18 . Then, we have
 1

K−1  κ3
1 X 
k∇Φ(xk )k2 = O + .
K K
k=0

To achieve an -accurate stationary point, the complexity is given by


• Gradient complexity: Gc() = K(N + 2) = O(κ
e 4 −1 ).

• Matrix-vector product complexities: MV() = K + KQ = O


e κ4 −1 .


The proof is then complete.

I Proof of Corollary 4
Choose Q = cq κ ln κ . Then, for a proper selection of the constant cq , we have CQ = Θ(κ2 ). To guarantee
 2 2 ρ2 M 2
µ2 L + µ2 + 2L CQ β ≤ 4 , we choose β = Θ(κ ), which implies 1 − τ = Θ(αµ). In addition,
αµ −4
6 1 + λ1 L 2 2
 2

we have w = Θ(κ ) and hence δ0 /(1 − τ ) = O(κ ). Then, we have


7 5

K−1  κ5
1 X κ4 
k∇Φ(xk )k2 = O + + .
K K K
k=0

Then, to achieve an -accurate stationary point, the complexity is given by


• Gradient complexity: Gc() = K(N + 2) = O(κ
e 5 −1 ).

• Matrix-vector product complexities: MV() = K + KQ = O


e κ6 −1 .


Then, the proof is complete.

J Proof of Theorem 3
We first provide two useful lemmas, which are then used to prove Theorem 3.
1
Lemma 4. Suppose Assumptions 1, 2 and 3 are satisfied. Choose inner stepsize α < L. Then, we have
∂ykN ∂y ∗ (xk ) ∂y ∗ (xk )
− ≤ (1 − αµ)N + wN kyk0 − y ∗ (xk )k,
∂xk ∂xk ∂xk
where we define
N N
 αρL(1 − (1 − αµ) 2 )  N 1 − (1 − αµ) 2
wN = α ρ + √ (1 − αµ) 2 −1 √ . (36)
1 − 1 − αµ 1 − 1 − αµ

25
Proof. Based on the updates of ITD-based method in Algorithm 2, we have, for j = 1, ...., N ,

∂ykj ∂y j−1 ∂y j−1


= k − α∇x ∇y g(xk , ykj−1 ) − α k ∇2y g(xk , ykj−1 ),
∂xk ∂xk ∂xk
0
∂yk
which, in conjunction with the fact that ∂xk = 0, yields

N −1 N −1
∂ykN X Y
= −α ∇x ∇y g(xk , ykj ) (I − α∇2y g(xk , yki )). (37)
∂xk j=0 i=j+1

Then, based on the optimality condition of y ∗ (x) and using the chain rule, we have

∂y ∗ (xk ) 2
∇x ∇y g(xk , y ∗ (xk )) + ∇y g(xk , y ∗ (xk )) = 0,
∂xk
which further yields
N −1
∂y ∗ (xk ) ∂y ∗ (xk ) Y
= (I − α∇2y g(xk , y ∗ (xk )))
∂xk ∂xk j=0
N
X −1 N
Y −1
−α ∇x ∇y g(xk , y ∗ (xk )) (I − α∇2y g(xk , y ∗ (xk ))). (38)
j=0 i=j+1

For the case where N = 1, based on eq. (37) and eq. (38), we have

∂ykN ∂y ∗ (xk ) ∂y ∗ (xk )


− ≤ (1 − αµ) + αρkyk0 − y ∗ (xk )k. (39)
∂xk ∂xk ∂xk
Next, we prove the case where N ≥ 2. By subtracting eq. (37) by eq. (38), we have
∂ykN ∂y ∗ (xk ) ∂y ∗ (xk )
− ≤ (1 − αµ)N
∂xk ∂xk ∂xk
N −1 N −1 N −1
∇x ∇y g(xk , ykj )
X Y Y
+α (I − α∇2y g(xk , yki )) − ∇x ∇y g(xk , y ∗ (xk )) (I − α∇2y g(xk , y ∗ (xk ))) , (40)
j=0 i=j+1 i=j+1
| {z }
∆j

where we define ∆j for notational convenience. Note that ∆j is upper-bounded by

∆j ≤(1 − αµ)N −1−j ρkykj − y ∗ (xk )k


N
Y −1 N
Y −1
+L (I − α∇2y g(xk , yki )) − (I − α∇2y g(xk , y ∗ (xk ))) . (41)
i=j+1 i=j+1
| {z }
Mj+1

For notational simplicity, we define a quantity Mj+1 in eq. (41) for the case where the product index starts
from j + 1. Next we upper-bound Mj+1 via the following steps.

Mj+1 ≤(1 − αµ)Mj+2 + (1 − αµ)N −j−2 αρkykj+1 − y ∗ (xk )k


(i) j+1
≤(1 − αµ)Mj+2 + (1 − αµ)N −j−2 αρ(1 − αµ) 2 kyk0 − y ∗ (xk )k
j 3
≤(1 − αµ)Mj+2 + (1 − αµ)N − 2 − 2 αρkyk0 − y ∗ (xk )k, (42)

26
where (i) follows by applying gradient descent to the strongly-convex smooth function g(xk , ·). Telescoping
eq. (42) further yields
N −1
X i−2 3
N −j−2
Mj+1 ≤(1 − αµ) MN −1 + (1 − αµ)i−j−2 (1 − αµ)N − 2 −2 αρkyk0 − y ∗ (xk )k
i=j+2
−j−3
NX
j i 3
≤(1 − αµ)N −j−2 MN −1 + (1 − αµ)i (1 − αµ)N − 2 − 2 − 2 αρkyk0 − y ∗ (xk )k
i=0
N −1
N −j−2
≤(1 − αµ) αρ(1 − αµ) 2 kyk0 − y ∗ (xk )k
−j−3
NX
j i 3
+ (1 − αµ)N − 2 + 2 − 2 αρkyk0 − y ∗ (xk )k
i=0
−j−2
NX
j i 3
≤ (1 − αµ)N − 2 + 2 − 2 αρkyk0 − y ∗ (xk )k,
i=0
N
PN −j−2 i 1−(1−αµ) 2
which, in conjunction with i=0 (1 − αµ) 2 ≤ √
1− 1−αµ
, yields
N
αρ(1 − (1 − αµ) 2 ) j 3
Mj+1 ≤ √ (1 − αµ)N − 2 − 2 kyk0 − y ∗ (xk )k. (43)
1 − 1 − αµ
Then, substituting eq. (43) into eq. (41) yields
j
∆j ≤(1 − αµ)N −1− 2 ρkyk0 − y ∗ (xk )k
N
αρL(1 − (1 − αµ) 2 ) 3 j
+ √ (1 − αµ)N − 2 − 2 kyk0 − y ∗ (xk )k. (44)
1 − 1 − αµ
Summing up eq. (44) over j from 0 to N − 1 yields
N −1 N N
X  αρL(1 − (1 − αµ) 2 )  0 N 1 − (1 − αµ) 2
∆j ≤ ρ + √ kyk − y ∗ (xk )k(1 − αµ) 2 −1 √ . (45)
j=0
1 − 1 − αµ 1 − 1 − αµ

Then, substituting eq. (45) into eq. (40) and using the notation wN in eq. (36), we have
∂ykN ∂y ∗ (xk ) ∂y ∗ (xk )
− ≤ (1 − αµ)N + wN kyk0 − y ∗ (xk )k. (46)
∂xk ∂xk ∂xk
Combining eq. (39) (i.e., N = 1 case) and eq. (46) (i.e., N ≥ 2 case) completes the proof.

Lemma 5. Suppose Assumptions 1, 2, 3 and 4 hold. Define


4M 2 wN
2
+ 4(1 − 41 αµ)L2 (1 + αLN )2
λN =
1 − 14 αµ − (1 − αµ)N (1 + 21 αµ)
 L2 4M 2 wN
2
L2
2
and w = 1 + αµ N
µ2 (1 − αµ) λN + µ2 , where wN is given in eq. (36). Let δk = k∇Φ(x
b 2
k ) − ∇Φ(xk )k +
2
λN − 4L2 1 + αLN kykN − y ∗ (xk )k2 denote the approximation error at the k th iteration. Choose stepsizes
 
1
1− 4 αµ 1
β 2 ≤ 2w and α ≤ 2L . Then, we have
k−1
 1 k X 1 k−1−j
δk ≤ 1 − αµ δ0 + Jk (1 − αµ)2N + 2wβ 2 1 − αµ k∇Φ(xj )k2 ,
4 j=0
4
j 2
Pk−1  ∂y ∗ (xk−j )
where Jk = j=0 1 − 41 αµ 4M 2 ∂xk−j is related to Jacobian matrix of response function.

27
Proof. First note that using the chain rule, ∇Φ(x
b k ) and ∇Φ(xk ) can be written as

N ∂ykN
∇Φ(x
b k ) =∇x f (xk , yk ) + ∇y f (xk , ykN ),
∂xk
∂y ∗ (xk )
∇Φ(xk ) =∇x f (xk , y ∗ (xk )) + ∇y f (xk , y ∗ (xk )). (47)
∂xk
Subtracting two equations in eq. (47), we have
N ∗
k∇Φ(x
b k ) − ∇Φ(xk )k ≤Lkyk − y (xk )k

∂ykN ∂y ∗ (xk ) ∂ykN


+ LkykN − y ∗ (xk )k + M − , (48)
∂xk ∂xk ∂xk
N
which, in conjunction with ∂yk PN −1 QN −1 PN −1
∂xk
= kα j=0 ∇x ∇y g(xk , ykj ) i=j+1 (I − α∇2y g(xk , yki ))k ≤ αL j=0 (1 −
αµ)N −1−j ≤ αLN , yields
 
N ∗ ∂y ∗ (xk ) ∂ykN
k∇Φ(x
b k ) − ∇Φ(xk )k ≤L 1 + αLN kyk − y (xk )k + M −
∂xk ∂xk
(i) ∗
 ∂y (x k )
≤ L + αL2 N kykN − y ∗ (xk )k + M (1 − αµ)N
∂xk
+ M wN kyk0 − y ∗ (xk )k, (49)
where (i) follows from Lemma 4. Using kyk0 − y ∗ (xk )k = kyk−1
N
− y ∗ (xk )k ≤ kyk−1
N
− y ∗ (xk−1 )k + L
µ kxk − xk−1 k
and taking the square on both sides of eq. (49), we have
 2 ∂y ∗ (xk ) 2
k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k ≤4L 1 + αLN kykN − y ∗ (xk )k2 + 4M 2 (1 − αµ)2N
∂xk
2
2 L
+ 4M 2 wN2 N
kyk−1 − y ∗ (xk−1 )k2 + 4M 2 wN 2
kxk − xk−1 k2 . (50)
µ
In the meanwhile, based on Lemma 2, we have,
 1  N
kykN − y ∗ (xk )k2 ≤(1 − αµ)N 1 + αµ kyk−1 − y ∗ (xk−1 )k2
2
 2  L2
+ 1+ (1 − αµ)N kxk−1 − xk k2 . (51)
αµ µ2
Based on α ≤ 2L 1
and the form of λN in Lemma 5, we have λN > 4L2 (1 + αLN )2 > 0. Then, multiplying
eq. (51) by λN and adding eq. (50), we have
  2 
k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k + λN − 4L 1 + αLN kykN − y ∗ (xk )k2
 1   2  ∂y ∗ (xk ) 2
≤ 1 − αµ λN − 4L2 1 + αLN N
kyk−1 − y ∗ (xk−1 )k2 + 4M 2 (1 − αµ)2N
4 ∂xk
 2  L2 2 2 L
2
+ 1+ (1 − αµ)N
λ N + 4M wN 2 kxk − xk−1 k ,
2
(52)
αµ µ2 µ
which, in conjunction with kxk −xk−1 k2 = β 2 k∇Φ(x
b 2 2 b 2 2
k−1 )k ≤ 2β k∇Φ(xk−1 )−∇Φ(xk−1 )k +2β k∇Φ(xk−1 )k
2

and using the notation of w in Lemma 5, yields


  2 
k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k + λN − 4L 1 + αLN kykN − y ∗ (xk )k2
 1   2  ∂y ∗ (xk ) 2
≤ 1 − αµ λN − 4L2 1 + αLN N
kyk−1 − y ∗ (xk−1 )k2 + 4M 2 (1 − αµ)2N
4 ∂xk
+ 2β 2 wk∇Φ(x
b 2 2
k−1 ) − ∇Φ(xk−1 )k + 2β wk∇Φ(xk−1 )k .
2
(53)

28
1− 1 αµ 2 
Using β 2 ≤ 2w 4
and the notation δk = k∇Φ(x
b 2 2
k ) − ∇Φ(xk )k + λN − 4L 1 + αLN kykN − y ∗ (xk )k2 in
the above eq. (53) yields

∂y ∗ (xk ) 2  1 
δk ≤ 4M 2 (1 − αµ)2N + 1 − αµ δk−1 + 2wβ 2 k∇Φ(xk−1 )k2 . (54)
∂xk 4
Telescoping the above eq. (54) over k yields
k−1
 1 k X 1 j ∂y ∗ (xk−j ) 2
δk ≤ 1 − αµ δ0 + 1 − αµ 4M 2 (1 − αµ)2N
4 j=0
4 ∂xk−j
k−1
X 1 k−1−j
+ 2wβ 2 1 − αµ k∇Φ(xj )k2 ,
j=0
4

which, in conjunction with the definition of Jk , finishes the proof.

Proof of Theorem 3
Choose the same stepsizes α and β as in Lemma 5. Then, based on the smoothness of Φ(·) (i.e., Lemma 2 in
Ji et al. (2021)), we have
β  β 
Φ(xk+1 ) ≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ k∇Φ(xk ) − ∇Φ(x
b k )k
2
2 2
(i) β  β   1 k
≤Φ(xk ) − − β 2 LΦ k∇Φ(xk )k2 + + β 2 LΦ δ0 1 − αµ
2 2 4
β k−1 k−1−j
 X  1 
+2 + β 2 LΦ wβ 2 1 − αµ k∇Φ(xj )k2
2 j=0
4
β 
+ + β 2 LΦ Jk (1 − αµ)2N (55)
2

where (i) follows from Lemma 5 with δk ≥ k∇Φ(x


b k ) − ∇Φ(xk )k . Then, telescoping the above eq. (55) over
2

k from 0 to K − 1 yields

β  K−1
X 4β( 21 + βLΦ )δ0
− β 2 LΦ k∇Φ(xk )k2 ≤ Φ(x0 ) − Φ(x∗ ) +
2 αµ
k=0
K−1
X 1 
+ Jk β + βLφ (1 − αµ)2N
2
k=0
K−1
X k−1
β  X 1 k−1−j
+2 2
+ β LΦ wβ 2
1 − αµ k∇Φ(xj )k2 , (56)
2 4
k=0 j=0

PK−1 Pk−1 k−1−j PK−1


which, combined with k=0 j=0 1 − 14 αµ k∇Φ(xj )k2 ≤ 4
αµ j=0 k∇Φ(xj )k2 , yields

1 8 1   1 K−1
X
− βLΦ − + βLΦ wβ 2 k∇Φ(xk )k2
2 αµ 2 K
k=0
K−1
Φ(x0 ) − Φ(x ) ∗ 4( 21 + βLΦ )δ0 1 1 X
+ βLΦ (1 − αµ)2N (57)

≤ + + Jk .
βK αµK 2 K
k=0

29
Based on the definition of Jk in Lemma 5, we have
K−1 K−1
X k−1 K−1
X X 1 j ∂y ∗ (xk−j ) 2 (i) 16M 2 X ∂y ∗ (xk ) 2
Jk = 1 − αµ 4M 2 ≤ , (58)
4 ∂xk−j αµ ∂xk
k=0 k=0 j=0 k=0

PK−1 Pk−1 PK−1 PK−1


where (i) follows from the inequality that k=0 j=0 aj bk−1−j ≤ k=0 ak j=0 bj . Choose β such
 
that βLΦ + αµ 2 + βLΦ wβ < 4 . In addition, based on eq. (49), recalling the definition that δ0 =
8 1 2 1
2  N ∗
k∇Φ(x
b 2 2
0 ) − ∇Φ(x0 )k + λN − 4L 1 + αLN ky0 − y ∗ (x0 )k2 , using the fact that k ∂y∂x(x0 0 ) k ≤ L
µ , we have

 L2 M 2 
N 2 (1 − αµ)N + wN
2
+ λN (1 − αµ)N ky0 − y ∗ (x0 )k2 + 2N
(59)

δ0 ≤ O (1 − αµ) .
µ2

Recall the definition τ = N 2 (1 − αµ)N + wN


2
+ λN (1 − αµ)N . Then, substituting eq. (58) and eq. (59) into
eq. (57) yields
K−1  Φ(x ) − Φ(x∗ ) τ ky − y ∗ (x )k2
1 X 0 0 0 (1 − αµ)2N
k∇Φ(xk )k2 ≤ O + 2
+
K βK µ K µ3 K
k=0

M2 2N 1 K−1
X ∂y ∗ (xk ) 2
+ 1 − αµ , (60)
αµ K ∂xk
k=0


which, in conjunction with k ∂y∂x(x) k ≤ µ,
L
completes the proof.

K Proof of Corollary 5

Based on the choice of α and N and using  < 1, we have w = Θ( κ2 )

(ln κ )2 √ √  + κ2 (ln κ )2
τ= + + = O(1), (61)
κ2 κ4
np q αµ o
1− 4
which, in conjunction with β = min αµ
40w , 2w , 8LΦ , yields β = Θ(κ
1 −3
). Substituting eq. (61) and
β = Θ(κ−3 ) into eq. (5) yields
K−1  κ3
1 X 
k∇Φ(xk )k2 = O + .
K K
k=0

Then, to achieve an -accurate stationary point, we have K = O(κ3 −1 ), and hence we have the following
complexity results.

• Gradient complexity: Gc() = K(N + 2) = O(κ4 −1 ln κ ).

• Matrix-vector product complexities:


κ
MV() = 2KN = O(κ4 −1 ln ).


Then, the proof is complete.

30
L Proof of Corollary 6
Based on the choice of α and N , we have

wN = Θ(α(ρ + αρLN )N ) = Θ(1),


4M 2 wN
2
+ 4(1 − 41 αµ)L2 (1 + αLN )2
λN = = Θ(κ),
1 − 14 αµ − (1 − αµ)N (1 + 21 αµ)

and hence w = Θ(κ4 ) and τ = Θ(κ). Then, we have β = Θ(κ3 ), and hence we obtain from eq. (5) that
K−1  κ3
1 X M 2 L2 
k∇Φ(xk )k2 =O + ,
K K αµ3
k=0

which finishes the proof.

M Proof of Theorem 4
We consider the following construction of loss functions.
1
f (x, y) = xT Zx x + M 1T y
2
1
g(x, y) = y T Zy y − LxT y + 1T y, (62)
2
" #
L 0
where Zx = Zy = and M is a positive constant. First note that the minimizer of inner-level function
0 µ
g(x, ·) and the total gradient ∇Φ(x) are given by

y ∗ (x) = Zy−1 (Lx − 1),


∇Φ(x) = Zx x + LM Zy−1 1. (63)

Based on the updates of ITD-based method in Algorithm 2, we have, for t = 0, ..., N

ykt = ykt−1 − α(Zy ykt−1 − Lxk + 1). (64)

Taking the derivative w.r.t. xk on the both sides of eq. (64) yields

∂ykt ∂y t−1
= (I − αZy ) k + αLI, (65)
∂xk ∂xk
0
∂yk
Telescoping the above eq. (65) over t from 1 to N and using the fact that ∂xk = 0, yields

N −1
∂ykN X
= αL (I − αZy )t ,
∂xk t=0

N
∂f (xk ,yk )
which, in conjunction with the update xk+1 = xk − β ∂xk , yields

 N
X −1 
xk+1 = xk − β Zx xk + αLM (I − αZy )t 1 . (66)
t=0

31
PN −1
For notational convenience, let ZN = α t=0 (I − αZy )t and x0 = 1. Telescoping eq. (66) over k from 0 to
K − 1 yields
K−1
X
xK =(I − βZx )K 1 − LM (I − βZx )k βZN 1
k=0

X
K
=(I − βZx ) 1 − LM Zx−1 ZN 1 + LM (I − βZx )k βZN 1
k=K

=(I − βZx )K 1 − LM Zx−1 ZN 1 + LM (I − βZx )K Zx−1 ZN 1. (67)

Rearranging the above eq. (67) yields

kZx (xK +LM Zx−1 Zy−1 )1k2


2
= Zx (I − βZx )K 1 + LM (I − αZy )N Zy−1 1 + LM (I − βZx )K ZN 1
2 2
≥L2 M 2 k(I − αZy )N Zy−1 1k2 + Zx (I − βZx )K 1 + L2 M 2 (I − βZx )K ZN 1

which, in conjunction with α ≤ L,


1
yields
 L2 M 2 
k∇Φ(xK )k2 ≥ L2 M 2 k(I − αZy )N Zy−1 1k2 = Θ (1 − αµ) 2N
, (68)
µ2
which holds for all K.

32

You might also like