0% found this document useful (0 votes)
29 views27 pages

Accelerated and Inexact Forward-Backward Algorithms

This paper presents a convergence analysis of accelerated forward-backward splitting methods for composite function minimization when the proximity operator is computed inexactly. It demonstrates that a convergence rate of 1/k^2 can be achieved under specific error conditions and provides a global complexity analysis for the algorithm. Additionally, numerical experiments validate the effectiveness of the proposed approach in real-life applications.

Uploaded by

sach.co.quy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views27 pages

Accelerated and Inexact Forward-Backward Algorithms

This paper presents a convergence analysis of accelerated forward-backward splitting methods for composite function minimization when the proximity operator is computed inexactly. It demonstrates that a convergence rate of 1/k^2 can be achieved under specific error conditions and provides a global complexity analysis for the algorithm. Additionally, numerical experiments validate the effectiveness of the proposed approach in real-life applications.

Uploaded by

sach.co.quy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SIAM J. OPTIM.

c 2013 Society for Industrial and Applied Mathematics



Vol. 23, No. 3, pp. 1607–1633

ACCELERATED AND INEXACT FORWARD-BACKWARD


ALGORITHMS∗
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

SILVIA VILLA† , SAVERIO SALZO‡ , LUCA BALDASSARRE§ , AND ALESSANDRO VERRI‡

Abstract. We propose a convergence analysis of accelerated forward-backward splitting methods


for composite function minimization, when the proximity operator is not available in closed form,
and can only be computed up to a certain precision. We prove that the 1/k 2 convergence rate for the
function values can be achieved if the admissible errors are of a certain type and satisfy a sufficiently
fast decay condition. Our analysis is based on the machinery of estimate sequences first introduced
by Nesterov for the study of accelerated gradient descent algorithms. Furthermore, we give a global
complexity analysis, taking into account the cost of computing admissible approximations of the
proximal point. An experimental analysis is also presented.

Key words. convex optimization, accelerated forward-backward splitting, inexact proximity


operator, estimate sequences, total variation

AMS subject classifications. 90C25, 49M07, 65K10, 94A08

DOI. 10.1137/110844805

1. Introduction. Let H be a Hilbert space and consider the optimization prob-


lem

(P) inf F (x), F (x) = f (x) + g(x),


x∈H

where
(H1) g : H → R is proper, lower semicontinuous (l.s.c.), and convex,
(H2) f : H → R is convex differentiable and ∇f is L-Lipschitz continuous on H
with L > 0, namely,

∇f (x) − ∇f (y) ≤ Lx − y ∀x, y ∈ H.

We denote by F∗ the infimum of F . We do not require in general the infimum to be


attained, nor to be finite. It is well known that problem (P) covers a wide range of
signal recovery problems (see [18] and references therein), including constrained and
regularized least-squares problems [27, 25, 51, 21], (sparse) regularization problems
in image processing, such as total variation denoising and deblurring (see, e.g., [50,
13, 12]), as well as machine learning tasks involving nondifferentiable penalties (see,
e.g., [4, 23, 42]).
The variety of applications to real-life problems stimulated the search of simple
first-order methods to solve (P), which can be applied to large scale problems. In
this area, a significant amount of research has been devoted to forward–backward
splitting methods, that allow one to decouple the contributions of the functions f
and g in a gradient descent step determined by f and in a backward implicit step
∗ Received by the editors August 17, 2011; accepted for publication (in revised form) May 13,

2013; published electronically August 6, 2013.


http://www.siam.org/journals/siopt/23-3/84480.html
† Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Mas-

sachusetts Institute of Technology, 16163, Genova, Italy ([email protected]).


‡ DIBRIS, University of Genova, 16145, Genova, Italy ([email protected], Alessandro.Verri@

unige.it).
§ Laboratory for Information and Inference Systems, EPFL STI IEL LIONS, ELD 243 (Batiment

ELD), Station 11, CH-1015 Lausanne, Switzerland (luca.baldassarre@epfl.ch).


1607

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1608 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

induced by g [17, 18, 35]. These schemes are also known under the name of proximal
gradient methods [61], since the implicit step relies on the computation of the so-
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

called proximity operator, introduced by Moreau in [39]. Though appealing for their
simplicity, gradient-based methods often exhibit a slow speed of convergence. For
this reason, resorting to the ideas contained in the work of Nesterov [44], there has
recently been an active interest in accelerations and modifications of the classical
forward-backward splitting algorithm [61, 45, 7]. We will study the following general
accelerated scheme

xk+1 = proxλk g (yk − λk ∇f (yk )),


(1.1)
yk+1 = c1,k xk+1 + c2,k xk + c3,k yk

for suitably chosen constants ci,k , (i = 1, 2, 3, k ∈ N) and parameters λk > 0—where


proxλk g : H → H denotes the proximity operator associated with λk g. In particular,
choosing c3,k = 0, procedure (1.1) encompasses the popular fast iterative shrinkage
thresholding algorithm (FISTA), whose optimal (in the sense of [43]) 1/k 2 conver-
gence rate for the objective values F (xk ) − F∗ has been proved in [7]. Furthermore,
the effectiveness of such accelerations has been tested empirically on several relevant
problems (see, e.g., [6, 8]).
Unfortunately, the proximity operator is in general not available in exact form
or its computation may be very demanding. Just to mention some examples, this
happens when applying proximal methods to image deblurring with total variation
[12, 6, 26], or to structured sparsity regularization problems in machine learning and
inverse problems [67, 28, 33, 42, 49, 2]. In those cases, the proximity operator is
usually computed using ad hoc algorithms, and therefore inexactly. See [17] for a list
of possible approaches. In the end, the entire procedure for solving problem (P) is
constituted by two nested loops: an external one of type (1.1) and an internal one
which serves to approximately compute the proximity operator occurring in the first
row of (1.1). Hence, the problem of studying the convergence of accelerated forward-
backward algorithms under possible perturbations of proximal points arises. In [6],
FISTA is applied to the total variation (TV) image deblurring problem and empiri-
cally it is shown to possibly generate divergent sequences when the prox subproblem
is solved inexactly. However, no theoretical analysis is carried out for the role of
inexactness in the convergence and acceleration properties of the algorithm.
1.1. Main contributions. From a theoretical point of view, the contribution of
this paper is threefold: first, we show that by considering a suitable notion of admissi-
ble approximation of the proximal point, it is possible to get quadratic convergence of
the inexact version of the accelerated forward-backward scheme (1.1). In particular,
we prove that the proposed algorithm shares the 1/k 2 convergence rate in the objec-
tive values if the computation of the proximity operator at the kth step is performed
up to a precision εk , with εk = O(1/k q ) and q > 3/2. This assumption clearly implies
summability of the errors, which is a common requirement in similar contexts (see,
e.g., [48, 18]). We underline, however, that, for slower convergence rates, summability
can be avoided and the requirement εk = O(1/k q ) with q > 1/2 is sufficient. The
second main contribution of the paper is the study of the global iteration complexity
of (1.1), which also takes into account the cost of computing admissible approxima-
tions of the proximity operator. Furthermore, we show that the proposed inexactness
criterion has an equivalent formulation in terms of duality gap, that can be easily
checked in practice. This allows us to handle most significant penalty terms and dif-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1609

ferent algorithms to compute the proximal point, as, for instance, those in [12, 19, 14].
This resolves the issue of convergence and applicability of the two-loops algorithm for
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

many real-life problems, in the same spirit as [15].


The third contribution concerns the techniques we employ to obtain the result.
The algorithm derivation relies on the machinery of estimate sequences [44]. Lever-
aging on the ideas developed in [52], we propose a flexible method to build estimate
sequences, that can be easily adapted to deal with inexactness in accelerated forward-
backward algorithms. It is worthwhile to mention that this framework includes the
well-known FISTA [7].
Finally, we performed numerical experiments investigating the impact of errors on
the acceleration property. We also illustrate the effectiveness of the proposed notion
of inexactness on two real-life problems, making performance comparisons with the
nonaccelerated version, and a benchmark primal-dual algorithm [14].

1.2. Related work. Forward-backward algorithms belong to the wider class


of proximal splitting methods [17]. All these methods require the computation of
the proximity operator, consequently approximations of proximal points have been
studied in a number of papers, and the following list does not claim to be exhaustive.
For nonaccelerated schemes, convergence in the presence of errors has been addressed
in various contexts ranging from proximal point algorithms [3, 48, 29, 34, 20, 19, 1, 59],
hybrid extragradient-proximal point algorithms [55, 56, 57, 63], generalized proximal
algorithms using Bregman distances [24, 58, 11] to forward-backward splitting [18].
On the other hand, only very recently, accelerated proximal methods under inex-
act evaluation of the proximity operator have been studied. In [31, 52] the classical
proximal point algorithm is treated (f = 0 in (1.1)). Paper [38] considers inexact ac-
celerated hybrid extragradient-proximal methods, but actually the framework is shown
to include only the case of the exact accelerated forward-backward algorithm. In [22],
convergence rates for an accelerated projected-subgradient method is proved. The case
of an exact projection step is considered, and the authors assume the availability of an
oracle that yields global lower and upper bounds on the function. Although interest-
ing, it leads to a slower convergence rate than proximal-gradient methods. Summariz-
ing, none of the studies above covers the case of accelerated inexact forward-backward
algorithms.
Finally, we mention the subsequent, but independent, work [54], where an analysis
of an accelerated proximal-gradient method with inexact proximity operator is given
too, and the same convergence rates are proved. While the accelerated scheme is very
similar (though not exactly equal1 ), the employed techniques are completely different.
In particular, the estimate sequences framework which motivates the updating rules
for the parameters and auxiliary sequences is not used in [54]. The inexactness notion
is different as well: our choice is more demanding, but leads to a better (weaker)
dependence on the errors decay. For instance, in [54] the authors obtain convergence
of the algorithm for εk = O(1/k 1+δ ), while we only need εk = O(1/k 1/2+δ ), and
the optimal convergence rate of the algorithm for εk = O(1/k 2+δ ), while Theorem
4.4 requires only εk = O(1/k 3/2+δ ). For a comparison between the two errors see
section 2. For completeness, in Appendix A we show that the framework of estimate
sequences can handle the type of errors considered in [54] as well, but only a 1/k
convergence rate can be obtained.

1 There, the sequence y in (1.1), is updated by setting c


k 3,k = 0, and the choice of the parameters
c1,k , c2,k is different too.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1610 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Note also that none of the abovementioned papers study the rate of convergence
of the nested algorithm, as we do in section 5.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

1.3. Outline of the paper. In section 2, we give a notion of admissible ap-


proximation of proximal points and discuss its applicability. Section 3 reviews the
framework of Nesterov’s estimate sequences and gives a general updating rule for
recursively constructing estimate sequences for convex problems. In section 4, we
present a new general accelerated scheme for forward-backward splitting algorithms
and a convergence theorem under admissible approximations of proximal points. In
section 5, we discuss the subproblem of computing inexact proximal points and the
complexity of the resulting global nested algorithm. Section 6 contains a numeri-
cal evaluation of the effect of errors in the computation of the proximal points on
the forward-backward algorithm (1.1). Finally, Appendix A discusses convergence of
accelerated forward-backward splitting algorithms for the error notion considered in
[54].
2. Inexact proximal points. The algorithms analyzed in this paper are based
on the computation of the proximity operator of a convex function, introduced by
Moreau [39, 40, 41], and then made popular in the optimization literature by Martinet
[36] and Rockafellar [48, 47].
Let R = R ∪ {±∞} be the extended real line. For a proper, convex, and l.s.c.
function g : H → R, λ > 0 and y ∈ H, the proximal point of y with respect to λg is
defined by setting
 
1 2
(2.1) proxλg (y) := argmin g(x) + x − y
x∈H 2λ
and the mapping proxλg : H → H is called the proximity operator of λg. If we let
1
Φλ (x) = g(x) + 2λ x − y2 , the first-order optimality condition for a convex minimum
problem yields
y−z
(2.2) z = proxλg (y) ⇐⇒ 0 ∈ ∂Φλ (z) ⇐⇒ ∈ ∂g(z),
λ
where ∂ denotes the subdifferential operator.
We already noted that, from a practical point of view, it is essential to replace
the proximal point with an approximate version of it.
2.1. The proposed notion. We employ here a concept of approximation of the
proximal point based on the ε-subdifferential, which is indeed a relaxation of condition
(2.2). We recall that, for ε ≥ 0, the ε-subdifferential of g at the point z ∈ domg is the
set ∂ε g(z) = {ξ ∈ H : g(x) ≥ g(z) + x − z, ξ − ε, ∀x ∈ H}.
Definition 2.1. Let ε ≥ 0. We say that z ∈ H is an approximation of proxλg (y)
with ε-precision and we write z ε proxλg (y) iff
y−z
(2.3) ∈ ∂ ε2 g(z).
λ 2λ

Note that if z ε proxλg (y), then necessarily z ∈ dom g, and hence the allowed
approximations are always feasible. This notion has been first proposed, in the context
of the proximal point algorithm, in [34] and successfully used in, e.g., [1, 19, 52].
A relative version of criterion (2.3) has recently been proposed for nonaccelerated
proximal methods in the preprint [37], which allows us to interpret the (exact) forward-
backward splitting algorithm as an instance of an inexact proximal point algorithm.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1611
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

C z

Fig. 2.1. Admissible approximation of PC (y).

Example 1. We describe the case where g is the indicator function of a closed


and convex set C, and the proximity operator is consequently the projection onto C,
denoted by PC . Given y ∈ H, it holds
ε2
(2.4) z ε PC (y) ⇐⇒ z ∈ C and x − z, y − z ≤ ∀x ∈ C.
2
Recalling that the projection PC (y) of a point y is the unique point z ∈ C which
satisfies x − z, y − z ≤ 0 for all x ∈ C, approximations of this type are therefore the
points enjoying a relaxed formulation of this property. From a geometric point of view,
the characterization of projection ensures that the convex set C is entirely contained
in the half-space determined by the tangent hyperplane at the point PC (y), namely,
C ⊆ {x ∈ X : x−PC (y), y −PC (y) ≤ 0}. To check that z satisfies condition (2.4), it
is enough to verify that C is entirely contained in the negative half-space determined
by the (affine) hyperplane of equation
 
y−z ε2
hε : x − z, = ,
y − z 2y − z

which is normal to y − z and at distance ε2 /(2y − z) from z. See Figure 2.1.
In the following we provide an analysis of the notion of inexactness given in
Definition 2.1, which will clarify the nature of these approximations and the scope of
applicability. To this purpose, we will make use of the duality technique, an approach
that is quite common in signal recovery and image processing applications [18, 12, 16].
The starting point is the Moreau decomposition formula [41, 18], stating

(2.5) y − λproxg∗ /λ (y/λ) = proxλg (y),

where g ∗ : H → R, g ∗ (y) = supx∈H ( x, y − g(x)), is the conjugate functional of g.


When proxg∗ /λ is easy to compute, formula (2.5) provides a convenient method to get
the proximity operator of λg.
A remarkable property of inexact proximal points based on criterion (2.3) is that,
in a sense, the Moreau decomposition still holds: if y, z ∈ H and ε, λ > 0, then, letting
η = ε/λ, it is

(2.6) z η proxg∗ /λ (y/λ) ⇐⇒ y − λz ε proxλg (y) .

This arises immediately from Definition 2.1 and the following equivalence (see Theo-
rem 2.4.4, item (iv), in [65]):

y − λz ∈ ∂ η2 λ g ∗ (z) ⇐⇒ z ∈ ∂ ε2 g(y − λz).


2 2λ

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1612 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Next, we prove that the proposed inexactness criterion can be formulated in


terms of the duality gap. This leads to a very natural and simple test for assessing
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

admissible approximations. Without loss of generality, we consider the case where g


has the following structure

(2.7) g(x) = ω(Bx),

with B : H → G a bounded linear operator between Hilbert spaces, and ω : G → R


a proper, l.s.c. convex function. The structure (2.7) often arises in regularization
methods for ill-posed inverse problems [13, 10, 28, 53, 67, 16]. By Definition 2.1,
finding proxλg (y) requires the solution of the minimization problem

1
(2.8) min Φλ (x), Φλ (x) = ω(Bx) + x − y2 .
x∈H 2λ

From now on, we assume ω is continuous in Bx0 for some x0 ∈ H. Then, the Fenchel–
Moreau–Rockafellar duality formula (see Corollary 2.8.5 in [65]) states that

(2.9) min Φλ (x) = − min Ψλ (v) ,


x∈H v∈G

where
1 1
(2.10) Ψλ (v) = λB ∗ v − y2 + ω ∗ (v) − y2 ,
2λ 2λ
or, equivalently, the minimum of the duality gap is zero:

(2.11) 0= min Φλ (x) + Ψλ (v) =: G(x, v) .


(x,v)∈H×G

Moreover, if v̄ is a solution of the dual problem minv Ψλ (v), then z̄ = y − λB ∗ v̄ solves


the primal problem (2.8). This also implies that minv G(y − λB ∗ v, v) = 0. The next
proposition shows that inexact proximal points have the same structure as the exact
ones.
Proposition 2.2. If z ε proxλg (y), then there exists v ∈ dom ω ∗ such that
z = y − λB ∗ v.
Proof. If z ε proxλg (y), by definition (y − z)/λ ∈ ∂ ε2 g(z). Then [65, The-

orem 2.4.2, item (ii)] ensures that g(z) + g ∗ ((y − z)/λ) ≤ z, (y − z)/λ + ε2 /(2λ),
hence g ∗ ((y −z)/λ) < +∞. Using the Fenchel–Moreau–Rockafellar duality [65, Corol-
lary 2.8.5], one can prove that g ∗ (w) = supB ∗ v=w ω ∗ (v). Thus there exists v ∈ G such
that B ∗ v = (y − z)/λ and ω ∗ (v) < +∞ and the statement follows.
Proposition 2.3. Let η = ε/λ, v ∈ G and consider the following statements:
(a) G(y − λB ∗ v, v) ≤ ε2 /(2λ);
(b) B ∗ v η proxg∗ /λ (y/λ);
(c) y − λB ∗ v ε proxλg (y).
Then (a) ⇒ (b) and (b) ⇔ (c). Furthermore if ω ∗ (v) = g ∗ (B ∗ v), they are all equiva-
lent.
Proof. The equivalence of (b) and (c) comes directly from the inexact Moreau
decomposition (2.6). Let us show that (a) ⇒ (b). From the definition of G and using

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1613

the fact that ω ∗ (v) ≥ g ∗ (B ∗ v), it follows


G(y − λB ∗ v, v)
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

1   1
= λB ∗ v2 − 2 λB ∗ v, y + λB ∗ v2 + sup w, y − λB ∗ v − g ∗ (w) + ω ∗ (v)
2λ 2λ w∈H
= B ∗ v, λB ∗ v − y + sup w, y − λB ∗ v − g ∗ (w) + ω ∗ (v)
w∈H
≥ sup w − B ∗ v, y − λB ∗ v − g ∗ (w) + g ∗ (B ∗ v)
w∈H
= sup −[g ∗ (w) − g ∗ (B ∗ v) − w − B ∗ v, y − λB ∗ v].
w∈H
(2.12)
Therefore if G(y − λB ∗ v, v) ≤ ε2 /(2λ), setting η = ε/λ, it holds
η2 λ
(2.13) ∀w ∈ H g ∗ (w) − g ∗ (B ∗ v) ≥ w − B ∗ v, y − λB ∗ v − ,
2
which is equivalent to y − λB ∗ v ∈ ∂η2 λ/2 g ∗ (B ∗ v) and hence to B ∗ v η proxg∗ /λ (y/λ).
As regards the second part of the statement, assuming ω ∗ (v) = g ∗ (B ∗ v), the inequality
in (2.12) becomes an equality and condition (a) is then equivalent to (2.13). Thus,
the reverse implication (b) ⇒ (a) follows.
Remark 1. In Proposition 2.3 the assumption ω ∗ (v) = g ∗ (B ∗ v), guaranteeing the
equivalence of statements (a), (b), (c), occurs in the following cases:
• ω is positively homogeneous. Indeed in that case ω ∗ = δS with S = ∂ω(0)
and g ∗ = δK with K = ∂g(0) = B ∗ (S). Thus, if v ∈ S, it is ω ∗ (v) = δS (v) =
δK (B ∗ v) = g ∗ (B ∗ v). This entails that
ε2
G(y − λB ∗ v, v) ≤ ⇐⇒ λB ∗ v ε PλK (y) ⇐⇒ y − λB ∗ v ε proxλg (y) .

• B is surjective. Indeed, in that case g ∗ (B ∗ v) = supx∈H ( Bx, v − ω(Bx)) =
ω ∗ (v). For instance, for B = id, it holds
ε2
G(y − λv, v) ≤ ⇐⇒ v η proxg∗ /λ (y/λ) .

We underline that in the two cases above the proposed inexact notion of prox is fully
characterized by means of the duality gap.
Summarizing, the implication (a) ⇒ (c) stated in Proposition 2.3 ensures that
admissible approximations of proximal points, in the sense of Definition 2.1, can
always be computed by approximately minimizing the duality gap G(y − λB ∗ v, v). In
general, condition (a) of Proposition 2.3 is only a sufficient condition to get inexact
proximal points with precision ε. However, as discussed in Remark 1, it becomes a
full characterization of inexact proximal points for a relevant class of penalties. We
finally highlight that condition (a) can be easily checked in practice, and will be at the
basis of the analysis of the convergence rate for the nested procedure in section 5.2.
2.2. Comparison with other kinds of approximation. Other notions of
inexactness for the proximity operator have been considered in the literature. One of
the first is
ε
(2.14) d(0, ∂Φλ (z)) ≤ ,
λ
which was proposed in [48], and treated also in [30].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1614 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Another notion, that we shall use in the appendix, replaces the exact minimum
in (2.1) with ε2 /(2λ)-minima, and is defined as follows:
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

def ε2
(2.15) z ε proxλg (y) ⇐⇒ Φλ (z) ≤ inf Φλ + .

The condition on the right-hand side of (2.15) is equivalent to 0 ∈ ∂ε2 /(2λ) Φλ (z)
and implies, by the strong convexity of Φλ , z − proxλg (y) ≤ ε (see [52]). This
type of error has been first considered in [3] and then employed, for instance, in
[19, 52, 66]. Lemma 1 in [52] shows that the criterion in (2.15) is more general than
both the ones in (2.3), (2.14). We also note that (again from Lemma 1 in [52]) the
error criterion proposed in [38, 55] for the approximate hybrid extragradient-proximal
point algorithm corresponds to a relative version of (2.15).
Here, to help positioning the proposed criterion, we give a proposition and a
corollary that directly link approximations in the sense of (2.3) with those in the
sense of (2.15), valid for a subclass of functions g.
Proposition 2.4. Let g : H → R be proper, convex, and l.s.c. with dom g
bounded, and y, z ∈ H. For every ε > 0, if 0 < δ ≤ diam(dom g) and diam(dom g)δ ≤
ε2 /2, then
z δ proxλg (y) =⇒ z ε proxλg (y) .
Proof. Let z δ proxλg (y). Thanks to Lemma 1 in [52], there exist δ1 , δ2 ≥ 0 with
δ12 + δ22 ≤ δ 2 and e ∈ H, e ≤ δ2 , such that (y + e − z)/λ ∈ ∂δ12 /(2λ) g(z). Therefore,
for every x ∈ dom g,
δ12
λg(x) − λg(z) ≥ x − z, y − z − diam(dom g)δ2 − .
2
Now it is easy to show that, if 0 < δ ≤ diam(dom g), then

δ2
sup diam(dom g)δ2 + 1 = diam(dom g)δ .
δ12 +δ22 ≤δ 2 2

Thus, if diam(dom g)δ ≤ ε2 /2, it holds λg(x) − λg(z) ≥ x − z, y − z − ε2/2 for every
x ∈ dom g, which proves that (y − z)/λ ∈ ∂ε2 2λg(z).
Proposition 2.4 states that for each ε > 0 one can get approximations of proximal
points in the sense of Definition 2.1 from approximations in the sense of (2.15) as
soon as the precision δ is chosen small enough.
Corollary 2.5. Let g : H → R be proper, convex, and l.s.c. with dom g ∗
bounded, and y, z ∈ H. For any ε > 0, if 0 < σ ≤ diam(dom g ∗ ) and σλ2 diam(dom g ∗ )
≤ ε2 /2, then
(2.16) z σ proxg∗ /λ (y/λ) =⇒ y − λz ε proxλg (y) .
In particular, suppose g is positively homogeneous (i.e., g(αx) = αg(x) for α ≥ 0).
Then, setting K := ∂g(0), if 0 < σ ≤ λdiamK and σλdiamK ≤ ε2 /2, it holds
(2.17) z σ PλK (y) =⇒ y − z ε proxλg (y) .
Proof. Set η = ε/λ. Then the condition σλ2 diam(dom g ∗ ) ≤ ε2 /2 is equivalent
to σdiam(dom g ∗ ) ≤ η 2 /2. Therefore, by applying Proposition 2.4 to the function g ∗ ,
we obtain
z σ proxg∗ /λ (y/λ) =⇒ z η proxg∗ /λ (y/λ) .
Then, the inexact Moreau decomposition (2.6) gives y − λz ε proxλg (y).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1615

In case g is positively homogeneous, λg is positively homogeneous too and (λg)∗ =


δ∂(λg)(0) = δλK , where K = ∂g(0). The hypotheses on σ, given in the second part of
the statement, ensure that 0 < σ ≤ diam(dom(λg)∗ ) and σdiam(dom(λg)∗ ) ≤ ε2 /2.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Thus (2.16) can be applied to the function λg, obtaining (2.17).


Remark 2. The hypothesis dom g ∗ bounded in Corollary 2.5 is satisfied (in fi-
nite dimension) for many significant regularization terms, like total variation, nuclear
norm, and structured sparsity regularization and it has been considered in similar
contexts, for instance, in [14, 9].
3. Nesterov’s estimate sequences. In [44], Nesterov illustrates a flexible
mechanism to produce minimizing sequences for an optimization problem. The idea
is to recursively generate a sequence of simple functions that approximate F . In this
section, we briefly describe this method and review the general results obtained in [52]
for constructing quadratic estimate sequences when F is convex. We do not provide
proofs, referring to the mentioned works for details.
3.1. General framework. We start by providing the definition and motivation
of estimate sequences.
Definition 3.1. A pair of sequences (ϕk )k∈N , ϕk : H → R and (βk )k∈N , βk ≥ 0
is called an estimate sequence of a proper function F : H → R iff βk → 0 and
(3.1) ∀x ∈ H, ∀k ∈ N : ϕk (x) − F (x) ≤ βk (ϕ0 (x) − F (x)) .
The next statement represents the main result about estimate sequences and
explains how to use them to build minimizing sequences and get corresponding con-
vergence rates.
Theorem 3.2. Let ((ϕk )k∈N , (βk )k∈N ) be an estimate sequence of F and denote by
(ϕk )∗ the infimum of ϕk . If, for some sequences (xk )k∈N , xk ∈ H and (δk )k∈N , δk ≥ 0
(3.2) F (xk ) ≤ (ϕk )∗ + δk ,
then, for any x ∈ domF ,
(3.3) F (xk ) ≤ βk (ϕ0 (x) − F (x)) + δk + F (x).
Thus, if δk → 0 (being also βk → 0), (xk )k∈N is a minimizing sequence for F , that is
limk→∞ F (xk ) = F∗ . If in addition the infimum F∗ is attained at some point x∗ ∈ H,
then the rate of convergence F (xk ) − F∗ ≤ βk (ϕ0 (x∗ ) − F∗ ) + δk holds true.
3.2. Construction of an estimate sequence. In this section, we review a
general procedure, introduced in [52], for generating an estimate sequence of a proper,
l.s.c., and convex function F : H → R. First of all, we deal with the generation of the
sequence of functions (ϕk )k∈N .
For any sequence of parameters ((zk , ηk , ξk , αk ))k∈N , (zk , ηk , ξk , αk ) ∈ domF ×
R+ × H × [0, 1) and any function ϕ : H → R, we recursively define the sequence of
functions (ϕk )k∈N by setting ϕ0 = ϕ and
(3.4) ϕk+1 (x) = (1 − αk )ϕk (x) + αk (F (zk+1 ) + x − zk+1 , ξk+1  − ηk ) .
One can prove that if ξk+1 ∈ ∂ηk F (zk+1 ), then
(3.5) ϕk+1 (x) − F (x) ≤ (1 − αk )(ϕk (x) − F (x)) ,

and, by induction, condition (3.1) is satisfied with βk = k−1i=0 (1 − αi ). If k∈N αk =


+∞, then βk → 0 and the pair ((ϕk )k∈N , (βk )k∈N ) is an estimate sequence of F .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1616 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Moreover, if the starting ϕ0 = ϕ is a quadratic function, written in canonical


form as ϕ0 (x) = (ϕ0 )∗ + A0 /2x − ν0 2 , with (ϕ0 )∗ ∈ R, A0 > 0, ν0 ∈ H, then all the
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ϕk ’s, defined according to (3.4), are quadratic functions

Ak
(3.6) ϕk (x) = (ϕk )∗ + x − νk 2 ,
2
and Ak , νk , and (ϕk )∗ can be recursively derived from the parameters (zk , ηk , ξk , αk )k∈N ,
(3.7)

⎪ Ak+1 = (1 − αk )Ak ,



⎨ αk
νk+1 = νk − ξk+1 ,
(1 − αk )Ak



⎪ α2k
⎩ (ϕk+1 )∗ = (1 − αk )(ϕk )∗ + αk (F (zk+1 ) + νk − zk+1 , ξk+1  − ηk ) − ξk+1 2 .
2Ak+1

Next, it remains to generate a sequence (xk )k∈N satisfying inequality (3.2) and to
study the asymptotic behavior of βk . To this aim we recall two lemmas, whose proofs
are provided in [52], that will be essential in the derivation of the algorithm.
Lemma 3.3. Suppose for some k ∈ N, ϕk is defined as in (3.6) and ϕk+1 according
to (3.4) with ξk+1 ∈ ∂ηk F (zk+1 ). If xk ∈ H satisfies F (xk ) ≤ (ϕk )∗ + δk for some
δk ≥ 0, then, setting yk = (1 − αk )xk + αk νk , for any λ > 0, it holds

λ α2k
(1−αk )δk +ηk +(ϕk+1 )∗ ≥ F (zk+1 )+ 2− ξk+1 2 + yk −(λξk+1 +zk+1 ), ξk+1  .
2 Ak+1 λ

Lemma 3.4. Given the sequence (λk )k∈N , λk ≥ λ > 0 and A > 0, a, b > 0, a ≤ b,
define (Ak )k∈N and (αk )k∈N recursively, such that A0 = A and for k ∈ N

α2k
αk ∈ [0, 1), with a ≤ ≤ b,
(1 − αk )Ak λk
Ak+1 = (1 − αk )Ak .

Then, the sequence defined by setting βk := i=0 (1 − αi ) satisfies βk = O(1/k 2 ).


k−1

Moreover, if (λk )k∈N is also bounded from above, βk ∼ 1/k 2 .


4. Derivation of the general algorithm. In this section, we show how the
mechanism of estimate sequences can be used to generate an inexact version of accel-
erated forward-backward algorithms. A general theorem of convergence will also be
provided.
We shall assume both the hypotheses (H1) and (H2), given in the introduction,
to be satisfied. The following lemma will enable us to build an appropriate estimate
sequence.
Lemma 4.1. For any x, y ∈ H, z ∈ domg, ε ≥ 0, and ζ ∈ ∂ε g(z) it holds

L
(4.1) F (x) ≥ F (z) + x − z, ∇f (y) + ζ − z − y2 − ε.
2
In other words, ∇f (y) + ζ ∈ ∂η F (z), with η = L/2z − y2 + ε .
Proof. Fix x, y, z ∈ H. Since ∇f is L-Lipschitz continuous, it holds

L
(4.2) f (y) ≥ f (z) − z − y, ∇f (y) − z − y2 .
2

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1617

On the other hand, being f convex, we have f (x) ≥ f (y) + x − y, ∇f (y), which
combined with (4.2) gives
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

L
(4.3) f (x) ≥ f (z) + x − z, ∇f (y) − z − y2 .
2
Since g is convex and ζ ∈ ∂ε g(z), we have g(x) ≥ g(z) + x − z, ζ − ε, that summed
with (4.3), gives the statement.
Combining Lemma 4.1 with Lemma 3.3, we derive the following result.
Lemma 4.2. Suppose for some k ∈ N, ϕk is defined as in (3.6) and xk ∈ H
satisfies F (xk ) ≤ (ϕk )∗ + δk for some δk ≥ 0. Set yk = (1 − αk )xk + αk νk . For any
εk ≥ 0, λk > 0, let

yk − xk+1 L ε2
xk+1 εk proxλk g (yk − λk ∇f (yk )), ξk+1 = , ηk = yk − xk+1 2 + k .
λk 2 2λk

Then ξk+1 ∈ ∂ηk F (xk+1 ), and if ϕk+1 is defined according to (3.4) with zk+1 = xk+1 ,
(4.4) 
ε2 λk α2k
(1 − αk )δk + k + (ϕk+1 )∗ ≥ F (xk+1 ) + 2 − λk L − ξk+1 2 .
2λk 2 (1 − αk )Ak λk

Proof. Recalling Definition 2.1, since xk+1 εk proxλk g (yk − λk ∇f (yk )), we have

yk − xk+1
(4.5) ζk+1 := − ∇f (yk ) ∈ ∂ε2k /(2λk ) g(xk+1 ) .
λk

Therefore Lemma 4.1 gives ξk+1 = ∇f (yk ) + ζk+1 ∈ ∂ηk F (xk+1 ) and Lemma 3.3 gives
(4.6) 
ε2k λk α2k
(1 − αk )δk + + (ϕk+1 )∗ ≥ F (xk+1 ) + 2− ξk+1 2
2λk 2 (1 − αk )Ak λk
L
+ yk − (λk ξk+1 + xk+1 ), ξk+1  − yk − xk+1 2 .
2
Now, since yk = λk ξk+1 + xk+1 , the scalar product on the right-hand side of (4.6) is
zero, and (4.4) follows.
We are now ready to define a general accelerated and inexact forward-backward
splitting (AIFB) algorithm and to prove its convergence rate.
Theorem 4.3. For fixed numbers t0 > 1, a ∈ ]0, 2[, sequences of parameters
(λk )k∈N , λk ∈ ]0, (2 − a)/L] and (ak )k∈N such that a ≤ ak ≤ 2 − λk L, and a sequence
of errors (εk )k∈N with εk ≥ 0, we choose x0 = y0 ∈ dom g and for every k ∈ N, we
recursively define

1 + 1 + 4(ak λk )t2k /(ak+1 λk+1 )
tk+1 = ,
2
(AIFB) xk+1 εk proxλk g (yk − λk ∇f (yk )) ,
tk − 1 tk
yk+1 = xk+1 + (xk+1 − xk ) + (1 − ak ) (yk − xk+1 ) .
tk+1 tk+1

Then, setting zk+1 = xk+1 , ξk+1 = (yk − xk+1 )/λk , ηk = L/2yk − xk+1 2 + ε2k /(2λk ),
and αk = t−1k , the sequence (ϕk )k∈N defined according to (3.4) starting from ϕ0 =

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1618 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

F (x0 ) + A0 /2 · −x0 2 , with A0 = 1/(t0 (t0 − 1)a0 λ0 ), is an estimate sequence for F
and it holds
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ck
(4.7) δk+1 + (ϕk+1 )∗ ≥ F (xk+1 ) + yk − xk+1 2 ,
2λk

with ck = 2 − λk L − ak ≥ 0, δ0 = 0, and δk+1 = (1 − αk )δk + ε2k /(2λk ).


Proof. Let k ∈ N. From the definition of tk+1 , it follows

λk ak 2
(4.8) t2k+1 − tk+1 − t = 0.
λk+1 ak+1 k

Since αk = t−1 k ∈ (0, 1) then, from (4.8), we have (1 − αk+1 )ak+1 λk+1 /α2k+1 =
2
ak λk /αk , and hence

α2k α2k+1
(4.9) (1 − αk ) = .
(1 − αk )ak λk (1 − αk+1 )ak+1 λk+1

If we set Ak = α2k /[(1 − αk )ak λk ], (4.9) turns into Ak (1 − αk ) = Ak+1 as in (3.7), and
the inequality ak ≤ 2 − λk L gives

α2k
(4.10) + λk L ≤ 2 .
(1 − αk )Ak λk

Next, the update yk+1 in (AIFB) can be written as


αk+1 αk+1
(4.11) yk+1 = (1 − αk+1 )xk+1 + [yk − (1 − αk )xk ] − ak (yk − xk+1 ) .
αk αk

Therefore, setting νk = α−1


k (yk − (1 − αk )xk ) for every k ∈ N, we have

(4.12) αk νk = yk − (1 − αk )xk ,
αk+1 νk+1 = yk+1 − (1 − αk+1 )xk+1

and hence, substituting into (4.11) and recalling the definition of Ak , we get
αk
(4.13) νk+1 = νk − (yk − xk+1 ) .
(1 − αk )Ak λk

If we set ξk+1 = (yk − xk+1 )/λk , (4.13) becomes as in (3.7). Now define (ϕk )k∈N
according to (3.4) using the parameters (xk , ηk , ξk , αk )k∈N and starting from ϕ0 =
F (x0 ) + A0 /2 · −x0 2 . Then ϕk = (ϕk )∗ + Ak /2 · −νk 2 for every k ∈ N and we
have δ0 + (ϕ0 )∗ ≥ F (x0 ). Reasoning by induction, and using Lemma 4.2, we obtain
ξk+1 ∈ ∂ηk F (xk+1 ) and (4.7). Finally note that, since by assumption and (4.10),
a ≤ α2k /((1 − αk )Ak λk ) ≤ 2, Lemma 3.4 ensures that βk = i=0 (1 − αi ) tends
k−1

to 0.
Remark 3 (retrieving FISTA[7]). In the initialization step of (AIFB), we are
allowed to choose t0 = 1, as soon as a0 = 1. Indeed, as one can easily check, with
these choices we get t1 > 1 and y1 = x1 . Therefore the sequences continue as if
they started from (t1 , x1 , y1 ). This shows that algorithm (AIFB) includes FISTA by
choosing ak = 1 and λk = λ ≤ 1/L, starting with t0 = 1. Moreover, for f = 0 and
ak = 2, we also obtain the proximal point algorithm given in the appendix of [30].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1619

Remark 4. (AIFB) can be equivalently written in terms of αk = 1/tk . This leads


to a generalization of the formulation given in [61, equations (34)–(36)]. In this case
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

αk updates as follows:
 
1 ak+1 λk+1 2 4 ak+1 λk+1 2 ak+1 λk+1 2
(4.14) αk+1 = αk + 4 αk − αk .
2 ak λk ak λk ak λk

We next consider convergence.


Theorem 4.4. Consider the AIFB algorithm for (λk )k∈N , λk ∈ [λ, (2 − a)/L],
with λ ∈ ]0, 2/L[, a ∈ (0, 2 − λL], and (ak )k∈N , a ≤ ak ≤ 2 − λk L.
Then, if εk = O(1/k q ) with q > 1/2, the sequence (xk )k∈N is minimizing for F
and if the infimum of F is attained the following bounds on the rate of convergence
hold true
⎧  

⎪ O 1/k 2 if q > 3/2,


⎨    
F (xk ) − F∗ = O 1/k 2 + O log k/k 2 if q = 3/2,



⎩O 1/k 2  + O 1/k 2q−1  if q < 3/2.

Proof. By Theorems 3.2 and 4.3, it is enough to study the asymptotic behavior
of the sequences βk and δk . Since λk ∈ [λ, (2 − a)/L], by Lemma 3.4, βk ∼ 1/k 2 .
Concerning the structure of the error term δk , it is easy to prove (see Lemma 3.3
in [30]) that the solution of the difference equation δk+1 = (1 − αk )δk + ε2k /(2λk ),
obtained in Theorem 4.3, is given by

βk  ε2i
k−1
(4.15) δk = .
2 i=0 λi βi+1

Hence the statement follows as in [52, Theorem 4.8].


The rates of convergence given in Theorem 4.4 hold for the function values and
not for the iterates, as is usual for accelerated schemes [7, 61]. In particular, we
proved that the proposed algorithm shares the convergence rate of the exact one, if
the errors εk , in the computation of the proximity operator in (1.1), decay as 1/k q
with q > 3/2. We underline that summability of the errors is not required to get
convergence, which is guaranteed for q > 1/2. If the infimum is not achieved, it is not
√k ) − F∗ , but inequality (3.3) ensures that a
possible to get a convergence rate for F (x
solution within accuracy σ requires O(1/ σ) iterations if q > 3/2 and O(1/σ 1/(2q−1) )
if 1/2 < q < 3/2. We finally point out that the results given in Theorem 4.4 provide
lower bounds for the convergence rates of the AIFB algorithm, meaning that faster
empirical rates might be observed for particular instances of problem (P).
Remark 5 (backtracking step size rule). As in other forward-backward splitting
schemes, the above procedure requires the explicit knowledge of the Lipschitz constant
of ∇f . Often in practice, especially for large scale problems, computing L might be
too demanding. For this reason, variants of forward-backward splitting algorithms
which avoid the computation of L have been proposed [45, 7]. They add a finite
subroutine called backtracking procedure without affecting the convergence rate. We
remark that a proper backtracking can be added to AIFB as well.
5. Study of the global nested algorithm. In this section we consider the
entire two-loops algorithm that results from the composition of AIFB with an inner
algorithm which computes the proximity operator.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1620 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

5.1. Computing admissible approximations. We first cope with the com-


putation of solutions of the subproblem
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(5.1) z ε proxλg (y)


required by the proposed algorithm at each iteration. There are various possibilities
for solving problem (5.1). In [20, 19] a bundle algorithm returning an element z ∈ H
satisfying (5.1) is provided, and convergence in a finite number of steps is proved when
g is Lipschitz continuous over bounded sets (see Algorithm 6.1 and Proposition 6.1 in
[19]). As in section 2, we consider the case of g(x) = ω(Bx). Propositions 2.2 and 2.3
state that for finding solutions of problem (5.1) it is sufficient to minimize the duality
gap. Indeed, if v ∈ G is such that
ε2
(5.2) G(y − λB ∗ v, v) ≤ ,

then z = y − λB ∗ v solves problem (5.1). It is evident that condition (5.2) can be
explicitly checked in practice. In the following, using the same notation as section 2, we
show that each algorithm that produces a minimizing sequence for the dual function
Ψλ yields a corresponding convergent sequence for the primal and, if ω is continuous
on the entire G, a minimizing sequence for the duality gap as well,
Theorem 5.1. Let dom ω = G, v be a solution of the dual problem min Ψλ , and
(vn )n∈N be a minimizing sequence for Ψλ . Let z = y − λB ∗ v be the solution of the
primal problem (2.8), and set zn = y − λB ∗ vn . Then it holds
zn → z , G(zn , vn ) → 0 .
2p
Moreover, if Ψλ (vn ) − Ψλ (v) = O(1/n ) for some p > 0, we have
 
1 1
(5.3) zn − z = O p
, G(z n , vn ) = O .
n np
Proof. We claim that
1
(5.4) zn − z2 ≤ Ψλ (vn ) − Ψλ (v).

To prove (5.4), first note that
1 1
(5.5) λB ∗ vn − y2 − λB ∗ v − y2 + Bz, vn − v
2λ 2λ
1 1
= λB ∗ (vn + v) − 2y, λB ∗ (vn − v) + 2(y − λB ∗ v), λB ∗ (vn − v)
2λ 2λ
1
= λB ∗ (vn + v) − 2λB ∗ v, λB ∗ (vn − v)

1
= λB ∗ (vn − v)2 .

Writing the first-order optimality conditions for v, we have that 0 ∈ B(λB ∗ v − y) +
∂ω ∗ (v) or, equivalently, the primal solution z satisfies Bz ∈ ∂ω ∗ (v), which implies
ω ∗ (vn ) − ω ∗ (v) − Bz, vn − v ≥ 0. Summing the last inequality with (5.5), we get
1 1
Ψλ (vn ) − Ψλ (v) = λB ∗ vn − y2 − λB ∗ v − y2 + ω ∗ (vn ) − ω ∗ (v)
2λ 2λ
1
≥ λB ∗ (vn − v)2

1
= zn − z2 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1621

Since dom ω = G, ω is continuous on G and hence Φλ is continuous on H. Therefore


Φλ (zn ) → Φλ (z). This implies, being that Φλ (z) = −Ψλ (v),
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

G(zn , vn ) = Φλ (zn ) + Ψλ (vn ) → Φλ (z) + Ψλ (v) = 0 .

Now suppose that Ψλ (vn ) − Ψλ (v) = O(1/n2p ). Then, the first part of statement
(5.3) directly follows from (5.4). Regarding the rate on the duality gap, note that the
function Φλ is Lipschitz continuous on bounded sets, being convex and continuous.
Thus there exists L1 > 0 such that
√ 1/2
Φλ (zn ) − Φλ (z) ≤ L1 zn − z ≤ L1 2λ (Ψλ (vn ) − Ψλ (v)) .

This shows that the convergence rate stated for the duality gap in (5.3) holds.
In order to compute admissible approximations of the proximal point, we can
choose any minimizing algorithm for the dual problem. A simple choice is the forward-
backward splitting algorithm (called also ISTA [7]). Since √ for this choice Ψλ (vn ) −
Ψλ (v) = O(1/n), this gives the rate G(zn , vn ) = O(1/ n) for the duality gap. We
remark that the pair of sequences (y − λB ∗ vn , vn ) corresponds exactly to the pair
(xn , yn ) generated by the primal-dual Algorithm 1 proposed in [14] when applied to
1
the minimization of Φλ (x) = g(x) + 2λ x − y2 (τ = λ, θ = 1).
A more efficient choice is FISTA, resulting in the rate G(zn , vn ) = O(1/n). The
latter will be our choice in the numerical section. For the case of ω positively homoge-
neous (e.g., total variation), it holds ω ∗ = δS , with S = ∂ω(0) and the corresponding
dual minimization problem min Ψλ becomes a constrained smooth optimization prob-
lem. Then, FISTA reduces to an accelerated projected gradient descent algorithm
 γn  1
(5.6) vn+1 = PS un − B(λB ∗ un − y) , 0 < γn ≤ ,
λ B2
tn − 1
un+1 = vn+1 + (vn+1 − vn )
tn+1

with the usual choices for tn (see Remark 3). We note that in this case Propositions 2.2
and 2.3 ensure that problem (5.1) is equivalent to (5.2).
Remark 6. We highlight that the results in Theorem 5.1 hold for the more general
setting of a minimization problem of the form

(5.7) min ω(Bx) + ϕ(x) ,


x∈X

where dom ϕ = X and ϕ is c-strongly convex and differentiable with L-Lipschitz con-
tinuous gradient.2 Indeed, in this case one has z = ∇ϕ∗ (−B ∗ v), zn = ∇ϕ∗ (−B ∗ vn )
and the strong convexity of ϕ∗ allows one to get the analogous bound of (5.4)

c2
zn − z2 ≤ Ψ(vn ) − Ψ(v) ,
2L

where Ψ is the dual of (5.7).

2 This is equivalent to requiring ϕ∗ strongly convex and differentiable with Lipschitz continuous

gradient. See Theorems 4.2.1 and 4.2.2 in Chapter 4 of [32].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1622 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

5.2. Global iteration complexity of the algorithm. Each iteration of AIFB


consists of a gradient descent step, which we refer to as external iteration, and an inner
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

loop, to approximate the proximity operator of g up to a precision εk . Theorem 5.1


proves that using FISTA to solve the dual problem guarantees G(zn , vn ) ≤ D/n for
a constant D > 0. This shows that 2λD/ε2  iterations suffice to get a solution of
problem (5.1). We note that, under the additional hypotheses ω ∗ (v)/v → +∞ and
γn constant, the same number of iterations is sufficient to get the same convergence
rate for the gap, using the sequences of ergodic means computed via Algorithm 1
proposed in [14]. On the other hand, the algorithm provided in [19] reaches the same
goal in O(1/ε4 ) iterations.
In general, given an (internal) algorithm that solves problem (5.1) in at most


(5.8) , p > 0,
ε2/p
iterations,3 we can bound the total iteration complexity of the AIFB algorithm. From
Theorem 4.4, if we let εk := 1/k q , and take k ≥ Ne , with
 1 
(C/ε) 2q−1 if 1/2 < q < 3/2 ,
Ne :=  1 
(C/ε) 2 if q > 3/2 ,

we have F (xk ) − F∗ ≤ ε, where C > 0 is the constant masked in the rates given in
Theorem 4.4. Now for each k ≤ Ne , from the hypothesis (5.8) on the complexity of the
2/p
internal algorithm, one needs at most Dλk /εk = Dλk k 2q/p internal iterations to get
an approximate proximal point xk+1 in (AIFB) with precision εk = 1/k q . Summing
all the internal iterations from 1 to Ne , and if λk ≤ λ, we have


Ne  Ne
2q/p Dλ
Ni = Dλk k ≤ Dλ t2q/p dt = N 2q/p+1
0 2q/p + 1 e
k=1

and hence
⎧  2q/p+1 
⎨O 1/ε 2q−1 if 1/2 < q < 3/2 ,
Ni =  2q/p+1 
⎩O 1/ε 2 if q > 3/2 .

Adding the costs of internal and external iterations together, we derive the following
proposition.
Proposition 5.2. Suppose problem (5.1) is solved in at most Dλ/ε2/p iterations,
for some constants p > 0 and D > 0. Then, the global iteration complexity Cg of
(AIFB) plus the inner algorithm is
⎧  2q/p+1   
⎨O 1/ε 2q−1 + O 1/ε 2q−1 1
if 1/2 < q < 3/2 ,
(5.9) C g = ci N i + c e N e =  2q/p+1   
⎩O 1/ε 2 1
+ O 1/ε 2 if q > 3/2 ,

where ci and ce denotes the unitary costs of each type of iteration.


3 The constant D in general depends on the starting point and the problem solution set, and at

the end on y. If dom ω ∗ is bounded, D can be chosen independently of y, since for most algorithms
it is majorized by diam(domω ∗ ).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1623

From the estimates above, one can easily see that, in each case, the lower global
complexity is reached for q → 3/2 and it is
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

p+3
Cg = O(1/ε 2p +δ )

for whatever small δ > 0. For p = 1, as it is in the case of algorithm (5.6), one
obtains a complexity of O(1/ε2+δ ). For p = 1/2, which corresponds to the rate of the
algorithm studied in [19], we have a global complexity of O(1/ε7/2+δ ). We finally note
that for p → +∞ we have a complexity of O(1/ε1/2+δ ), with δ that can be chosen
arbitrarily small. In other words, the algorithm behaves as an accelerated method.
We remark that the analysis of the global complexity given above is valid only
asymptotically, since we did not estimate any of the constants hidden in the O sym-
bols. However, in real situations constants do matter and, in practice, the most
effective accuracy rate q is problem dependent and might be different from 3/2, as we
illustrate in the experiments of subsection 6.3.
6. Numerical experiments. In this section, we present two types of experi-
ments. The first one is designed to illustrate the influence of the errors on the behavior
of AIFB and on its nonaccelerated counterpart IFB (called ISTA in [7]). The second
one is meant to measure the performance of the two-loops algorithm AIFB+algorithm
(5.6), in comparison with IFB+algorithm (5.6), and with the primal-dual algorithm
proposed in [14].
6.1. Experimental set-up. In all the following cases, we consider the regular-
ized least-squares functional
1
(6.1) F (x) := Ax − y2Y + g(x) ,
2
where H, Y are Euclidean spaces, x ∈ H, y ∈ Y, A : H → Y is a linear operator, and
g : H → R is of type (2.7). In all cases ω will be a norm and the projection onto
S = ∂ω(0) will be explicitly computable.
We minimize F using AIFB, with λk = λ = 1/L, where L = A∗ A. We use
ak = 1 (corresponding to FISTA), since we empirically observed that the choice of
ak , if independent of k, does not significantly influence the speed of convergence of
the algorithm (although preliminary tests revealed a slightly better performance for
ak = 0.8). At each iteration, we employ algorithm (5.6) to approximate the proximity
operator of g up to a precision εk . The stopping rule for the inner algorithm is
given by the duality gap, according to Proposition 2.3, item (a). Following Theorem
4.4, we consider sequences of errors of type εk = C/k q , with q, hereafter referred as
accuracy rate, chosen between 0.1 and 1.7. The coefficient C should be comparable
to the magnitude of the duality gap. In fact, it determines the practical constraint
on the duality gap at the first iterations: the constraint should be active, but not
too demanding to avoid unnecessary precision. We choose C by solving the equation
G(y0 − λ∇f (y0 ), 0) = C 2 /(2λ), where G is the duality gap corresponding to the first
proximal subproblem encountered in AIFB for k = 0, evaluated at v0 = 0. We finally
consider an “exact” version, obtained by solving the proximal subproblems at the
machine precision.
We analyze two well-known problems: deblurring with total variation regulariza-
tion and learning a linear estimator via regularized empirical risk minimization with
the overlapping group lasso penalty. The numerical experiments are divided into two
parts. In the first one, we evaluate the impact of the errors on the convergence rate

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1624 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

of AIFB and the (nonaccelerated) forward-backward splitting (here denoted as IFB).


The plot of the relative objective values (F (xk ) − F∗ )/F∗ against the number of exter-
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

nal iterations for different accuracy rates on the error is shown. We underline that this
study is independent of the algorithm chosen to produce an admissible approximation
of the proximal points.
In the second part, we assess the overall behavior of the two-loops algorithm, as
described in section 5, using algorithm (5.6) to solve the proximal subproblems. We
compare it with the nonaccelerated version (IFB) and the primal-dual (PRIDU) algo-
rithm proposed by [14] for image deconvolution. For all algorithms we provide CPU
time, and the number of external and internal iterations for different precisions. Note
that the cost of each external iteration relies mainly in the evaluation of the gradient
of the quadratic part of the objective function (6.1). The internal iteration has a sim-
ilar form, but being the matrix B is sparse and structured in both experiments, can
be implemented in a fast way. All the numerical experiments have been performed in
the MATLAB environment,4 on a desktop iMac with Intel Core i5 CPU, 2.5 Ghz, 6
MB cache L3, and 6 GB of RAM.
6.1.1. Deblurring with total variation. Regularization with total variation
[50, 12, 6] is a widely used technique for deblurring and denoising images, that pre-
serves sharp edges.
In this problem, H = Y = RN ×N is the space of (discrete two dimensional) images
on the grid [1, N ]2 , A is a linear map representing some blurring operator [6], and y
is the observed noisy and blurred datum. The (discrete) total variation regularizer is
defined as
N
g = ω ◦ ∇, g(x) = τ (∇x)i,j 2 ,
i,j=1
2
where ∇ : H → H is the (discrete) gradient operator (see [12] for the precise defini-
tion) and ω : H2 → R, ω(p) = τ i,j=1 pi,j 2 with τ > 0 a regularization parameter,
N

and ·2 the euclidean norm in R2 . Note that the matrix corresponding to ∇ is highly
sparse (it is bidiagonal). This feature has been taken into account to get an efficient
implementation.
We followed the same experimental setup as in [6]. We considered the 256 × 256
Lena test image, blurred by a 9 × 9 Gaussian blur with standard deviation 4, followed
by additive normal noise with zero mean and standard deviation 10−3 . The regular-
ization parameter τ was set to 10−3 . Since the blurring operator A is a convolution
operator, in the implementation it is common to evaluate it by an FFT-based method
(see, e.g., [6, 14]).
6.1.2. Overlapping group lasso. The group lasso penalty is a regularization
term for ill-posed inverse problems arising in statistical learning [64, 33], image pro-
cessing and compressed sensing [46], and enforcing structured sparsity in the solutions.
Regularization with this penalty consists in solving a problem of the form (6.1), where
H = Rp , Y = Rm , A is a data or design matrix, and y is a vector of outputs or mea-
surements. Following [33], the overlapping group lasso (OGL) penalty is
⎛ ⎞1/2
r 
(6.2) g(x) = τ ⎝ (wji )2 x2j ⎠ ,
i=1 j∈Ji

4 The code is available upon request to the authors.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1625

where J = {J1 , . . . , Jr } is a collection of overlapping groups of indices such that


!r
i=1 Ji = {1, . . . , p}. The weights wj are defined as
i
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

 aij
1
wji = , with aij = #{J ∈ J : j ∈ J, J ⊂ Ji , J = Ji }.
2

This penalty can be written as ω ◦ B, with B = (B1 , . . . , Br ) : Rp → RJ1 × · · · RJr ,

B i : R p → RJ i , Bi x = (wji xj )j∈Ji ,
r
and ω : RJ1 × · · · RJr → R, ω(v1 , . . . , vr ) = τ i=1 vi 2 , where  · 2 is the euclidean
norm in RJi .
The matrix A and the datum y are generated from the breast cancer dataset
provided by [62]. The dataset consists of expression data for 8,141 genes in 295
breast cancer tumors (78 metastatic and 17 nonmetastatic). The groups are defined
according to the canonical pathways from MSigDB [60], that contains 639 groups of
genes, 637 of which involve genes from the breast cancer dataset. We restrict the
analysis to the 3510 genes that are contained in at least one group. Hence, our data
matrix A consists of 295 different expression levels of 3510 genes. The output vector y
contains the labels (±1, metastatic, or nonmetastatic) for each sample. The structure
of the overlapping groups gives rise to a matrix B of size 15126 × 3510. Despite the
high dimensionality, one can take advantage of its sparseness. We analyze two choices
of the regularization parameter: τ = 0.01 and τ = 0.1.
6.2. Results—Part I. We run AIFB and its nonaccelerated counterpart, IFB,
up to 2.000 external iterations. With the aim of maximizing the effect of inexactness,
we require algorithm (5.6) to produce solutions with errors close to the upper bounds
2k /2λ prescribed by the theory. We achieve this by reducing the internal step-size
length γn and using cold restart, i.e., initializing at each step algorithm (5.6) with
v0 = 0.
As a reference optimal value, F∗ , we use the value found afters 10,000 iterations
of AIFB with error rate q = 1.7.
As shown in Figure 6.1, the empirical convergence rate of (F (xk ) − F∗ )/F∗ is
indeed affected by the accuracy rate q: to smaller values of q correspond slower
convergence rates both for AIFB and the inexact (nonaccelerated) forward-backward
algorithm. When the errors in the computation of the proximity operator do not
decay fast enough, the convergence rates are much deteriorated and the algorithms
can even not converge to the infimum. If the errors decay sufficiently fast, AIFB
shows a faster convergence w.r.t. IFB in both experiments. In contrast, this is not
true for accuracy rates q < 1, where IFB has practically the same behavior as AIFB.
Moreover, it turns out that AIFB is more sensitive to errors than IFB. This is
more evident in the experiment on TV deblurring. Indeed, for AIFB most curves
corresponding to the different accuracy rates are well separated, while for IFB they
are closer to each other, and often completely overlapped. Yet, the overlapping phe-
nomenon in general starts earlier (lower q) for IFB than AIFB, indicating that no gain
is obtained in increasing the accuracy error rates over a certain level, in accordance
with the theoretical results.
6.3. Results—Part II. This section is the empirical counterpart of subsec-
tion 5.2. Here, we test the global iteration complexity of AIFB and inexact IFB
combined with algorithm (5.6) on the two problems described above. We provide the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1626 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

+ +

+ +

+ +

+
+
+ +

+
+
+
+
+

+
+

+
+ +

Fig. 6.1. Impact of the errors on AIFB and IFB. Log-log plots of relative objective value versus
external iterations, k, obtained for TV deblurring (upper row) and the OGL problem with regular-
ization parameter τ = 10−1 (bottom row). The AIFB and inexact IFB for different accuracy rates
q in the computation of the proximity operator are shown in the left and right column, respectively.
For larger values of the parameter q the curves overlap. It can be seen from visual inspection, that
the errors affect the acceleration.

number of external iterations and the total number of inner iterations. When taking
into account the cost of computing the proximity operator, there is a trade-off between
the number of external and internal iterations. Since internal and external iterations
in general have different computational costs—which depend on the specific problem
considered and the machine CPU—the total number of iterations is not a good mea-
sure of the algorithm’s performance. For instance, on our computer, the ratio between
the cost of the external and internal iteration is about 2.15 in the TV deblurring and
2.5 in the OGL problem. Therefore, we also report the CPU time needed to reach
a desired accuracy for the relative difference from the optimal value. In this part,
we use the warm-restart procedure, consisting in initializing algorithm (5.6) with the
solution obtained at the previous step. We empirically observed that this initializa-
tion strategy drastically reduces the total number of iterations and speeds up the
algorithm.
We compare AIFB and IFB with PRIDU taken as a benchmark, since it often
outperforms state-of-the-art methods, in particular for TV regularization (see the
numerical section in [14]).
Algorithm PRIDU depends on two parameters5 σ, ρ > 0. In our experiments, we
tested two choices, indicated by the authors (in the paper and code as well) for the
image deblurring and denoising problem: σ = 10 and ρ = 1/(σB2 ), and ρ = 0.01
(corresponding to σ = 1/(ρB2 ) = 12.5 for the TV problem and σ  1.07 for the
OGL problem). We also implemented the algorithm for the OGL problem and, as

5 Denoted σ and τ in [14].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1627

a consequence of preliminary tests, the same choices of parameters turn out to be


appropriate too.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

On the other hand, AIFB and IFB depend on the accuracy rate q. We verified
that the best empirical results are obtained choosing q in the range [1, 1.5] for AIFB
and [0.1, 0.5] for IFB. This once more confirms the higher sensitivity to the errors of
the accelerated version w.r.t. the basic one. In Tables 6.1–6.3, we detail the results
only for the most significant choices of q. We remark that the “exact” version of AIFB
(and IFB), where the prox is computed at machine precision at each step, is not even
comparable to the results we reported here.

Table 6.1
Deblurring with TV regularization, τ = 10−3 . Performance evaluation of AIFB, IFB, and
PRIDU, corresponding to different choices of the parameters q and σ, respectively. Concerning
AIFB and IFB, the results are reported only for the q’s giving the best results. The entries in the
table refer to the CPU time (in seconds) needed to reach a relative difference w.r.t. to the optimal
value below the thresholds 10−4 , 10−6 , and 10−8 , the number of external iterations (# Ext), and
the total number of internal iterations (# Int).

Precision 10−4 10−6 10−8


Algo Time # Ext # Int Time # Ext # Int Time # Ext # Int
AIFB
q=1 11.8 137 1062 124.2 905 12313 1750 8776 182006
q = 1.3 16.2 118 1600 63.6 387 6437 272.1 1300 28350
q = 1.5 26.0 117 2734 98.7 373 10540 414.5 1085 45297
IFB
q = 0.1 36.9 1341 1341 147.2 5346 5346 635.4 23031 23031
q = 0.8 36.9 1341 1341 147.2 5346 5346 635.4 23031 23031
q = 1.0 63.2 1337 4533 189.9 5226 11126 745.1 18224 48333
PRIDU
σ = 10 7.4 362 - 165.7 8186 - 4684 231848 -
σ = 12.5 6.2 310 - 132.2 6609 - 3715 185588 -

Table 6.2
Breast cancer dataset: OGL τ = 10−1 . Performance evaluation of AIFB, IFB, and PRIDU,
corresponding to different choices of the parameters q and σ, respectively. Concerning AIFB and
IFB, the results are reported only for the q’s giving the best results. The entries in the table refer to
the CPU time (in seconds) needed to reach a relative difference w.r.t. to the optimal value below the
thresholds 10−4 , 10−6 , and 10−8 , the number of external iterations (# Ext), and the total number
of internal iterations (# Int).

Precision 10−4 10−6 10−8


Algo Time # Ext # Int Time # Ext # Int Time # Ext # Int
AIFB
q=1 3.9 104 3985 41.5 983 42239 414.1 9748 421769
q = 1.3 2.1 51 2103 11.2 247 11389 60.4 1179 61915
q = 1.5 2.8 50 2857 16.2 199 16945 61.3 548 64518
IFB
q = 0.1 5.3 1675 1682 10.7 3421 3428 16.0 5124 5131
q = 0.3 5.2 1613 1730 10.3 3246 3363 15.9 5065 5182
q = 0.5 4.4 1217 1827 9.5 2850 3460 14.9 4603 5213
q = 0.8 7.0 585 6092 15.5 2218 11264 19.8 3599 12645
q =1 12.4 535 12031 26.6 1236 25547 42.1 3606 36508
PRIDU
σ = 10 10.5 2901 - 25.4 7040 - 47.4 13141 -
σ = 1.07 5.8 1602 - 11.0 3026 - 16.1 4452 -

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1628 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Table 6.3
Breast cancer dataset: OGL, τ = 10−2 . See caption of Table 6.2.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Precision 10−4 10−6 10−8


Algo Time # Ext # Int Time # Ext # Int Time # Ext # Int
AIFB
q = 0.8 11.8 443 11392 74.4 2651 72109 1124 39699 1089732
q=1 12.1 432 11616 44.8 1581 43191 170.9 6004 164849
q = 1.3 27.0 431 27311 126.9 1572 129708 502.9 4687 518492
q = 1.5 62.0 431 64351 312.5 1572 325868 1303 4686 1362149
IFB
q = 0.1 34.9 11125 11125 69.4 22111 22111 112.3 35782 35782
q = 0.3 34.9 11125 11125 69.4 22111 22111 112.3 35782 35782
q = 0.5 35.6 11124 11946 70.1 22109 22931 113.0 35781 36603
q = 0.8 133.7 11095 114686 218.3 21883 178405 273.2 35781 203992
q=1 335.7 11093 348408 659.7 21818 643374 882.9 33075 851890
PRIDU
σ = 10 21.8 5625 - 44.6 11529 - 82.5 21346 -
σ = 1.07 4.6 1178 - 24.7 6407 - 827.5 214558 -

As concerns the TV problem, AIFB (q = 1.3 or q = 1.5) outperforms both PRIDU


and IFB, for high precisions. PRIDU exhibits a fast convergence at the beginning, but
then explodes in correspondence with higher precisions, for both choices of σ. This is
a known drawback of primal-dual algorithms with fixed step size (see, e.g., [9]).
The behavior for the OGL problem is presented for two choices of the regular-
ization parameter, since this heavily influences the results. For τ = 0.1 and precision
10−4 , AIFB is the fastest. For the middle precision, all the algorithms’ performances
are comparable. For the highest precision, PRIDU and IFB perform better. We notice
the very good behavior of IFB, which is probably due to the warm-restart strategy
combined with the greater stability of IFB against errors. Finally, for the OGL with
τ = 0.01, AIFB still accelerates IFB at the lower precisions if q is properly tuned,
though at the end IFB wins. The PRIDU algorithm suffers from the same drawbacks
noted in the TV experiment for σ = 1.07, but exhibits an overall good performance
with σ = 10.
Summarizing, the performance of algorithm AIFB combined with (5.6) and warm
restart is comparable with state-of-the-art algorithms, being sometimes better. To
this purpose, the experiments also give some guidelines for choosing the parameter q.
We also show situations where the acceleration is lost, in particular, referring to high
precision.
Appendix A. Accelerated FB algorithms under error criterion (2.15).
We give here a discussion of the behavior of AIFB, replacing the notion of inex-
actness (2.3) with the one given in (2.15) (denoted by ε ): this is the kind of error
considered in [54]. More precisely, the following theorem holds true.
Theorem A.1. In the same hypotheses of Theorem 4.4, replace at each step the
update xk+1 in AIFB with xk+1 εk proxλk g (yk − λk ∇f (yk )), where εk = O(1/k q )
and q > 3/2. Assuming in addition that ak ≤ 1 and λk L ≤ 1, then the following
convergence rates on the objective values hold:


⎨O (1/k) if q > 2 ,
 2 
F (xk ) − F∗ = O log k/k if q = 2 ,

⎩  
O 1/k 2q−3 if q < 2 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1629

The above rates can be obtained relying on our techniques, and are slower than the
ones given in Theorem 4.4. This is in line with what was obtained in [52, section 4.1].
In [54], using different techniques, the convergence rate O(1/k 2 ) is proved.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Lemma A.2. Let x ε proxλg (y − λ∇f (y)), λL ≤ 1, and w ∈ H such that


w − y ≤ αγ for some γ > 0 and α ∈ [0, 1). Then there exist ε1 , ε2 ≥ 0 with
ε21 + ε22 ≤ ε2 , and e ∈ H with e ≤ ε2 + αγ such that, if ζ = (w − λ∇f (w) − x − e)/λ,
then ζ ∈ ∂ ε21 g(x) .

Proof. By Lemma 1 in [52], there exist ε1 , ε2 ≥ 0 with ε21 + ε22 ≤ ε2 and ē ∈ H,
such that ē ≤ ε2 and

y − λ∇f (y) − x − ē
ζ := ∈ ∂ ε21 g(x).
λ 2λ

By adding and subtracting (w + ∇f (w)) from the previous equation, we get


w − λ∇f (w) − x − e
(A.1) ζ= with e = ē + w − y + λ∇f (y) − λ∇f (w) .
λ
By the Baillon–Haddad theorem [5], we have

e ≤ ē + (I − λ∇f )(w) − (I − λ∇f )(y) ≤ ε2 + αγ .

Remark 7. In the proof of Theorem 4.3, the set up of the parameters defining
the estimate sequence for AIFB does not depend on the notion of inexactness for
the proximal point. More precisely, starting from the AIFB algorithm, but with the
notion of inexact prox (2.15), the same auxiliary sequences (αk )k∈N , (Ak )k∈N , (νk )k∈N
can be introduced and all the equations (4.9)–(4.13) remain true. In particular,

(A.2) yk = (1 − αk )xk + αk νk ,
αk
(A.3) νk+1 = νk − (yk − xk+1 ) .
(1 − αk )Ak λk

The critical point is that now we cannot argue ξk+1 = (yk − xk+1) /λk ∈ ∂ηk F (xk+1 )
anymore, since Lemma 4.2 requires xk+1 to be an inexact prox in the sense of (2.3).
Hence the construction of the estimate sequence cannot be finalized.
The following lemma overcomes this situation by introducing an estimate sequence
centered on new points uk ’s, which are “close” to the νk ’s. It is the analogue of
Lemma 4.2 for errors of type (2.15).
Lemma A.3. Suppose for some k ∈ N, xk , uk , νk ∈ H, Ak > 0, ϕk = (ϕk )∗ +
Ak /2 · −uk 2 are such that F (xk ) ≤ (ϕk )∗ + δk and νk − uk  ≤ γk for some
α2k
γk , δk ≥ 0. Let λk > 0, αk ∈ [0, 1), and assume (1−αk )A k λk
≤ 1 and λk L ≤ 1. Set
yk = (1−αk )xk +αk νk , wk = (1−αk )xk +αk uk , and xk+1 εk proxλk g (yk −λk ∇f (yk ))
for some εk ≥ 0. Then there exist ek ∈ H, ε1k , ε2k > 0, ε21k + ε22k ≤ ε2k , ek  ≤
ε2k + αk γk such that, if ϕk+1 is defined according to (3.4) with zk+1 = xk+1 and

wk − xk+1 − ek L ε2
ξk+1 = , ηk = wk − xk+1 2 + 1k ,
λk 2 2λk
we have ξk+1 ∈ ∂ηk F (xk+1 ) and

(εk + αk γk )2
(A.4) (1 − αk )δk + + (ϕk+1 )∗ ≥ F (xk+1 ) .
2λk

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1630 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

Moreover, if νk is updated according to (A.3) and uk+1 is the center of ϕk+1 (which
is defined according to the second equation in (3.7)), it holds
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(A.5) uk+1 − νk+1  ≤ γk+1


with γk+1 = γk + εk /αk .
Proof. Clearly yk − wk  ≤ αk γk . Then, from Lemma A.2, we get
wk − λk ∇f (wk ) − xk+1 − ek
(A.6) ζk+1 := ∈ ∂ ε21k g(xk+1 )
λk 2λ

for some ek ∈ H, ε1k , ε2k > 0 with ε21k +ε22k ≤ ε2k , ek  ≤ ε2k +αk γk . From Lemma 4.1
it follows ξk+1 := (wk −xk+1 −ek )/λk = ∇f (wk )+ζk+1 ∈ ∂ηk F (xk+1 ). Then applying
Lemma 3.3, we have
ε21k
(1 − αk )δk + + (ϕk+1 )∗
2λk

λk α2k
≥ F (xk+1 ) + 2− ξk+1 2 + wk − (λk ξk+1 + xk+1 ), ξk+1 
2 Ak+1 λk
L
− wk − xk+1 2
2 
1 α2k
= F (xk+1 ) − λk ξk+1 2 − 2 wk − xk+1 , λk ξk+1 
2λk Ak+1 λk

2
+ λk Lwk − xk+1 

1
≥ F (xk+1 ) − wk − xk+1 − λk ξk+1 2
2λk
1
= F (xk+1 ) − ek 2 ,
2λk
where in the last inequality we use the assumptions α2k /(Ak+1 λk ) ≤ 1 and λk L ≤ 1.
Moreover, from the definitions of ε1k , ε2k , and ek it holds
(εk + αk γk )2 ≥ ε21k + ε22k + 2εk αk γk + (αk γk )2 ≥ ε21k + (ε2k + αk γk )2 ≥ ε21k + ek 2
and (A.4) follows. To prove (A.5), first note that from the definition we derive
1
(A.7) u k = νk + (wk − yk ) .
αk
Next, by (3.7), (A.7), (A.3), and the definition of ek in (A.1), we get
αk
uk+1 = uk − (wk − xk+1 − ek )
(1 − αk )Ak λk
αk 1
= νk − (yk − xk+1 ) + (wk − yk )
(1 − αk )Ak λk αk
αk
− (wk − yk − ek )
(1 − αk )Ak λk

1 α2k
= νk+1 + (wk − yk ) − λk (∇f (wk ) − ∇f (yk ))
αk (1 − αk )Ak λk

α2k
+ ēk .
(1 − αk )Ak λk

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1631

Therefore, recalling that by assumption α2k /((1 − αk )Ak λk ) ≤ 1 and ēk  ≤ ε2k , the
Baillon–Haddad theorem implies
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

1   εk
uk+1 − νk+1  ≤ wk − yk  + ε2k ≤ γk + .
αk αk
Proof of Theorem A.1. Taking into account Remark 7 and reasoning by induction,
Lemma A.3 ensures that there exist sequences (ξk )k∈N , (ηk )k∈N , such that ξk+1 ∈
∂ηk F (xk+1 ) and the sequence (ϕk )k∈N constructed according to (3.4) with zk+1 =
xk+1 , starting from ϕ0 = F (x0 ) + A0 /2 · −u0 2 and u0 = x0 , satisfies

δk + (ϕk )∗ ≥ F (xk )

with δ0 = 0 and
(αk γk+1 )2 εk
δk+1 = (1 − αk )δk + , γk+1 = γk + , γ0 = 0 .
2λk αk
This shows that the sequence (δk )k∈N is actually the same studied in [52, IAPPA1].
The statement now follows from the subsequent Theorem 4.5 in [52].

REFERENCES

[1] Y. I. Alber, R. S. Burachik, and A. N. Iusem, A proximal point method for nonsmooth
convex optimization problems in Banach spaces, Abstr. Appl. Anal., 2 (1997), pp. 97–120.
[2] A. Argyriou, C. A. Micchelli, M. Pontil, L. Shen, and Y. Xu, Efficient First Order
Methods for Linear Composite Regularizers, preprint, arXiv:1104.1436v1, 2011.
[3] A. Auslender, Numerical methods for nondifferentiable convex optimization, Nonlinear Anal-
ysis and Optimization, Math. Programming Stud., (1987), pp. 102–126.
[4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing
penalties, Found. Trends Mach. Learn., 4 (2012), pp. 1–106.
[5] H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex
Anal., 17 (2010), pp. 781–787.
[6] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained total variation
image denoising and deblurring, IEEE Trans. Image Process., 18 (2009), pp. 2419–2434.
[7] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202.
[8] S. Becker, J. Bobin, and E. Candès, NESTA: A fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sci., 4 (2011), pp. 1–39.
[9] S. Bonettini and V. Ruggiero, On the convergence of primal-dual hybrid gradient algorithms
for total variation image restoration, J. Math. Imaging Vision, 44 (2012), pp. 1–18.
[10] K. Bredies, A forward-backward splitting algorithm for the minimization of non-smooth convex
functionals in Banach space, Inverse Problems, 25 (2009), 015005.
[11] R. S. Burachik and B. F. Svaiter, A relative error tolerance for a family of generalized
proximal point methods, Math. Oper. Res., 26 (2001), pp. 816–831.
[12] A. Chambolle, An algorithm for total variation minimization and applications, J. Math.
Imaging Vision, 20 (2004), pp. 89–97.
[13] A. Chambolle and P.-L. Lions, Image recovery via total variation minimization and related
problems, Numer. Math., 76 (1997), pp. 167–188.
[14] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with
applications to imaging, J. Math. Imaging Vision, 40 (2011), pp. 120–145.
[15] C. Chaux, J.-C. Pesquet, and N. Pustelnik, Nested iterative algorithms for convex con-
strained image recovery problems, SIAM J. Imaging Sci., 2 (2009), pp. 730–762.
[16] P. L. Combettes, D. Dũng, and B. C. Vũ, Dualization of signal recovery problems, Set-
Valued Var. Anal., 18 (2010), pp. 373–404.
[17] P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal processing, in
Fixed-Point Algorithms for Inverse Problems in Science and Engineering, H. H. Bauschke,
R. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, eds., Springer-
Verlag, New York, 2011, pp. 185–212.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1632 S. VILLA, S. SALZO, L. BALDASSARRE, AND A. VERRI

[18] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting,


Multiscale Model. Simul., 4 (2005), pp. 1168–1200.
[19] R. Cominetti, Coupling the proximal point algorithm with approximation methods, J. Optim.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Theory Appl., 95 (1997), pp. 581–600.


[20] R. Correa and C. Lemarechal, Convergence of some algorithms of convex minimization,
Math. Program., 62 (1993), pp. 261–275.
[21] I. Daubechies, G. Teschke, and L. Vese, Iteratively solving linear inverse problems under
general convex constraints, Inverse Problems Imaging, 1 (2007), pp. 29–46.
[22] O. Devolder, F. Glineur, and Y. Nesterov, First-order methods of smooth convex opti-
mization with inexact oracle, Math. Program., (2011), pp. 1–39.
[23] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting,
J. Mach. Learn. Res., 10 (2009), pp. 2899–2934.
[24] J. Eckstein, Approximate iterations in Bregman-function-based proximal algorithms, Math.
Program., 83 (1998), pp. 113–123.
[25] B. Eicke, Iteration methods for convexly constrained ill-posed problems in Hilbert space, Nu-
mer. Funct. Anal. Optim., 13 (1992), pp. 413–429.
[26] E. Esser, X. Zhang, and T. F. Chan, A general framework for a class of first order primal-
dual algorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 3 (2010),
pp. 1015–1046.
[27] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, Gradient projection for sparse
reconstruction: Application to compressed sensing and other inverse problems, IEEE J.
Sel. Top. Signal Process., 1 (2007), pp. 586–597.
[28] M. Fornasier, ed., Theoretical Foundations and Numerical Methods for Sparse Recovery,
Radon Ser. Comput. Appl. Math. 9, De Gruyter, Berlin, 2010.
[29] O. Güler, On the convergence of the proximal point algorithm for convex minimization, SIAM
J. Control Optim., 29 (1991), pp. 403–419.
[30] O. Güler, New proximal point algorithms for convex minimization, SIAM J. Optim., 2 (1992),
pp. 649–664.
[31] B. He and X. Yuan, An accelerated inexact proximal point algorithm for convex minimization,
J. Optim. Theory Appl., 154 (2012), pp. 536–548.
[32] J.-B. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms.
II, Grundlehren Math. Wiss. 306, Springer-Verlag, Berlin, 1993.
[33] R. Jenatton, J.-Y. Audibert, and F. Bach, Structured variable selection with sparsity-
inducing norms, J. Mach. Learn. Res., 12 (2011), pp. 2777–2824.
[34] B. Lemaire, About the convergence of the proximal method, in Advances in Optimization
(Lambrecht, 1991), Lecture Notes in Econom. and Math. Systems 382, Springer, Berlin,
1992, pp. 39–51.
[35] P. L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators,
SIAM J. Numer. Anal., 16 (1979), pp. 964–979.
[36] B. Martinet, Régularisation d’inéquations variationnelles par approximations successives,
Rev. Française Inform. Rech. Oper., 4 (1970), pp. 154–158.
[37] R. D. C. Monteiro and B. F. Svaiter, Convergence Rate of Inexact Proximal Point Meth-
ods with Relative Error Criteria for Convex Optimization, http://www.optimization-
online.org/DB HTML/2010/08/2714.html (2010).
[38] R. Monteiro and B. Svaiter, An accelerated hybrid proximal extragradient method for convex
optimization and its implications to second-order methods, SIAM J. Optim., 23 (2013),
pp. 1092–1125.
[39] J.-J. Moreau, Fonctions convexes duales et points proximaux dans un espace hilbertien, C. R.
Acad. Sci. Paris Ser. I Math., 255 (1962), pp. 2897–2899.
[40] J.-J. Moreau, Propriétés des applications “prox,” C. R. Acad. Sci. Paris Ser. I Math., 256
(1963), pp. 1069–1071.
[41] J.-J. Moreau, Proximité et dualité dans un espace hilbertien, Bull. Soc. Math. France, 93
(1965), pp. 273–299.
[42] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa, Solving structured sparsity
regularization with proximal methods, in Machine Learning and Knowledge Discovery in
Databases, Lecture Notes in Comput. Sci. 6322, J. Balcázar, F. Bonchi, A. Gionis, and
M. Sebag, eds., Springer, Berlin, 2010, pp. 418–433.
[43] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimiza-
tion, Wiley-Intersci. Ser. Discrete Math., Wiley, New York, 1983.
[44] Y. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course, Appl. Optim.
87, Kluwer Academic, Boston, MA, 2004.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ACCELERATED AND INEXACT FORWARD-BACKWARD ALGORITHMS 1633

[45] Y. Nesterov, Gradient Methods for Minimizing Composite Objective Function, Technical re-
port, CORE Discussion Papers from Université Catholique de Louvain, Center for Opera-
tions Research and Econometrics No 2007/076, 2009.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

[46] G. Peyré and J. Fadili, Group sparsity with overlapping partition functions, in Proc. EU-
SIPCO 2011, Barcelona, 2011, pp. 303–307.
[47] R. T. Rockafellar, Augmented Lagrangians and applications of the proximal point algorithm
in convex programming, Math. Oper. Res., 1 (1976), pp. 97–116.
[48] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control
Optim., 14 (1976), pp. 877–898.
[49] L. Rosasco, S. Mosci, M. S. Santoro, A. Verri, and S. Villa, A regularization approach
to nonlinear variable selection, JMLR Workshop Conf. Proc., 9 (2010), pp. 653–660.
[50] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algo-
rithms, Phys. D, 60 (1992), pp. 259–268.
[51] A. Sabharwal and L. C. Potter, Convexly constrained linear inverse problems: Iterative
least-squares and regularization, IEEE Trans. Signal Process., 46 (1998), pp. 2345–2352.
[52] S. Salzo and S. Villa, Inexact and accelerated proximal point algorithm, J. Convex Anal., 19
(2012).
[53] O. Scherzer, M. Grasmair, H. Grossauer, M. Haltmeier, and F. Lenzen, Variational
Methods in Imaging, Appl. Math. Sci. 167, Springer, New York, 2009.
[54] M. Schmidt, N. Le Roux, and F. Bach, Convergence rates of inexact proximal-gradient
methods for convex optimization, in Advances in Neural Information Processing Systems
24, 2011.
[55] M. V. Solodov and B. F. Svaiter, A hybrid approximate extragradient-proximal point algo-
rithm using the enlargement of a maximal monotone operator, Set-Valued Anal., 7 (1999),
pp. 323–345.
[56] M. V. Solodov and B. F. Svaiter, A comparison of rates of convergence of two inexact
proximal point algorithms, in Nonlinear Optimization and Related Topics (Erice, 1998),
Appl. Optim. 36, Kluwer Academic, Dordrecht, 2000, pp. 415–427.
[57] M. V. Solodov and B. F. Svaiter, Error bounds for proximal point subproblems and associ-
ated inexact proximal point algorithms, Math. Program., 88 (2000), pp. 371–389.
[58] M. V. Solodov and B. F. Svaiter, An inexact hybrid generalized proximal point algorithm
and some new results on the theory of Bregman functions, Math. Oper. Res., 25 (2000),
pp. 214–230.
[59] M. V. Solodov and B. F. Svaiter, A unified framework for some inexact proximal point
algorithms, Numer. Funct. Anal. Optim., 22 (2001), pp. 1013–1035.
[60] A. Subramanian et al., Gene set enrichment analysis: A knowledge-based approach for inter-
preting genome-wide expression profiles, Proc. Natl. Acad. Sci. USA, 102 (2005), p. 15545.
[61] P. Tseng, Approximation accuracy, gradient methods, and error bound for structured convex
optimization, Math. Program., 125 (2010), pp. 263–295.
[62] M. J. Van De Vijver et al., A gene-expression signature as a predictor of survival in breast
cancer, New England J. Med., 347 (2002), pp. 1999–2009.
[63] Y. Yao and N. Shahzad, Strong convergence of a proximal point algorithm with general errors,
Optim. Lett., 6 (2012), pp. 621–628.
[64] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J.
R. Stat. Soc. Ser. B Stat. Method, 68 (2006), pp. 49–67.
[65] C. Zălinescu, Convex Analysis in General Vector Spaces, World Scientific Publishing, River
Edge, NJ, 2002.
[66] A. J. Zaslavski, Convergence of a proximal point method in the presence of computational
errors in Hilbert spaces, SIAM J. Optim., 20 (2010), pp. 2413–2421.
[67] P. Zhao, G. Rocha, and B. Yu, The composite absolute penalties family for grouped and
hierarchical variable selection, Ann. Statist., 37 (2009), pp. 3468–3497.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like