Accelerated and Inexact Forward-Backward Algorithms
Accelerated and Inexact Forward-Backward Algorithms
DOI. 10.1137/110844805
where
(H1) g : H → R is proper, lower semicontinuous (l.s.c.), and convex,
(H2) f : H → R is convex differentiable and ∇f is L-Lipschitz continuous on H
with L > 0, namely,
unige.it).
§ Laboratory for Information and Inference Systems, EPFL STI IEL LIONS, ELD 243 (Batiment
induced by g [17, 18, 35]. These schemes are also known under the name of proximal
gradient methods [61], since the implicit step relies on the computation of the so-
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
called proximity operator, introduced by Moreau in [39]. Though appealing for their
simplicity, gradient-based methods often exhibit a slow speed of convergence. For
this reason, resorting to the ideas contained in the work of Nesterov [44], there has
recently been an active interest in accelerations and modifications of the classical
forward-backward splitting algorithm [61, 45, 7]. We will study the following general
accelerated scheme
ferent algorithms to compute the proximal point, as, for instance, those in [12, 19, 14].
This resolves the issue of convergence and applicability of the two-loops algorithm for
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Note also that none of the abovementioned papers study the rate of convergence
of the nested algorithm, as we do in section 5.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Note that if z ε proxλg (y), then necessarily z ∈ dom g, and hence the allowed
approximations are always feasible. This notion has been first proposed, in the context
of the proximal point algorithm, in [34] and successfully used in, e.g., [1, 19, 52].
A relative version of criterion (2.3) has recently been proposed for nonaccelerated
proximal methods in the preprint [37], which allows us to interpret the (exact) forward-
backward splitting algorithm as an instance of an inexact proximal point algorithm.
C z
which is normal to y − z and at distance ε2 /(2y − z) from z. See Figure 2.1.
In the following we provide an analysis of the notion of inexactness given in
Definition 2.1, which will clarify the nature of these approximations and the scope of
applicability. To this purpose, we will make use of the duality technique, an approach
that is quite common in signal recovery and image processing applications [18, 12, 16].
The starting point is the Moreau decomposition formula [41, 18], stating
This arises immediately from Definition 2.1 and the following equivalence (see Theo-
rem 2.4.4, item (iv), in [65]):
1
(2.8) min Φλ (x), Φλ (x) = ω(Bx) + x − y2 .
x∈H 2λ
From now on, we assume ω is continuous in Bx0 for some x0 ∈ H. Then, the Fenchel–
Moreau–Rockafellar duality formula (see Corollary 2.8.5 in [65]) states that
where
1 1
(2.10) Ψλ (v) = λB ∗ v − y2 + ω ∗ (v) − y2 ,
2λ 2λ
or, equivalently, the minimum of the duality gap is zero:
1 1
= λB ∗ v2 − 2 λB ∗ v, y + λB ∗ v2 + sup w, y − λB ∗ v − g ∗ (w) + ω ∗ (v)
2λ 2λ w∈H
= B ∗ v, λB ∗ v − y + sup w, y − λB ∗ v − g ∗ (w) + ω ∗ (v)
w∈H
≥ sup w − B ∗ v, y − λB ∗ v − g ∗ (w) + g ∗ (B ∗ v)
w∈H
= sup −[g ∗ (w) − g ∗ (B ∗ v) − w − B ∗ v, y − λB ∗ v].
w∈H
(2.12)
Therefore if G(y − λB ∗ v, v) ≤ ε2 /(2λ), setting η = ε/λ, it holds
η2 λ
(2.13) ∀w ∈ H g ∗ (w) − g ∗ (B ∗ v) ≥ w − B ∗ v, y − λB ∗ v − ,
2
which is equivalent to y − λB ∗ v ∈ ∂η2 λ/2 g ∗ (B ∗ v) and hence to B ∗ v η proxg∗ /λ (y/λ).
As regards the second part of the statement, assuming ω ∗ (v) = g ∗ (B ∗ v), the inequality
in (2.12) becomes an equality and condition (a) is then equivalent to (2.13). Thus,
the reverse implication (b) ⇒ (a) follows.
Remark 1. In Proposition 2.3 the assumption ω ∗ (v) = g ∗ (B ∗ v), guaranteeing the
equivalence of statements (a), (b), (c), occurs in the following cases:
• ω is positively homogeneous. Indeed in that case ω ∗ = δS with S = ∂ω(0)
and g ∗ = δK with K = ∂g(0) = B ∗ (S). Thus, if v ∈ S, it is ω ∗ (v) = δS (v) =
δK (B ∗ v) = g ∗ (B ∗ v). This entails that
ε2
G(y − λB ∗ v, v) ≤ ⇐⇒ λB ∗ v ε PλK (y) ⇐⇒ y − λB ∗ v ε proxλg (y) .
2λ
• B is surjective. Indeed, in that case g ∗ (B ∗ v) = supx∈H ( Bx, v − ω(Bx)) =
ω ∗ (v). For instance, for B = id, it holds
ε2
G(y − λv, v) ≤ ⇐⇒ v η proxg∗ /λ (y/λ) .
2λ
We underline that in the two cases above the proposed inexact notion of prox is fully
characterized by means of the duality gap.
Summarizing, the implication (a) ⇒ (c) stated in Proposition 2.3 ensures that
admissible approximations of proximal points, in the sense of Definition 2.1, can
always be computed by approximately minimizing the duality gap G(y − λB ∗ v, v). In
general, condition (a) of Proposition 2.3 is only a sufficient condition to get inexact
proximal points with precision ε. However, as discussed in Remark 1, it becomes a
full characterization of inexact proximal points for a relevant class of penalties. We
finally highlight that condition (a) can be easily checked in practice, and will be at the
basis of the analysis of the convergence rate for the nested procedure in section 5.2.
2.2. Comparison with other kinds of approximation. Other notions of
inexactness for the proximity operator have been considered in the literature. One of
the first is
ε
(2.14) d(0, ∂Φλ (z)) ≤ ,
λ
which was proposed in [48], and treated also in [30].
Another notion, that we shall use in the appendix, replaces the exact minimum
in (2.1) with ε2 /(2λ)-minima, and is defined as follows:
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
def ε2
(2.15) z ε proxλg (y) ⇐⇒ Φλ (z) ≤ inf Φλ + .
2λ
The condition on the right-hand side of (2.15) is equivalent to 0 ∈ ∂ε2 /(2λ) Φλ (z)
and implies, by the strong convexity of Φλ , z − proxλg (y) ≤ ε (see [52]). This
type of error has been first considered in [3] and then employed, for instance, in
[19, 52, 66]. Lemma 1 in [52] shows that the criterion in (2.15) is more general than
both the ones in (2.3), (2.14). We also note that (again from Lemma 1 in [52]) the
error criterion proposed in [38, 55] for the approximate hybrid extragradient-proximal
point algorithm corresponds to a relative version of (2.15).
Here, to help positioning the proposed criterion, we give a proposition and a
corollary that directly link approximations in the sense of (2.3) with those in the
sense of (2.15), valid for a subclass of functions g.
Proposition 2.4. Let g : H → R be proper, convex, and l.s.c. with dom g
bounded, and y, z ∈ H. For every ε > 0, if 0 < δ ≤ diam(dom g) and diam(dom g)δ ≤
ε2 /2, then
z δ proxλg (y) =⇒ z ε proxλg (y) .
Proof. Let z δ proxλg (y). Thanks to Lemma 1 in [52], there exist δ1 , δ2 ≥ 0 with
δ12 + δ22 ≤ δ 2 and e ∈ H, e ≤ δ2 , such that (y + e − z)/λ ∈ ∂δ12 /(2λ) g(z). Therefore,
for every x ∈ dom g,
δ12
λg(x) − λg(z) ≥ x − z, y − z − diam(dom g)δ2 − .
2
Now it is easy to show that, if 0 < δ ≤ diam(dom g), then
δ2
sup diam(dom g)δ2 + 1 = diam(dom g)δ .
δ12 +δ22 ≤δ 2 2
Thus, if diam(dom g)δ ≤ ε2 /2, it holds λg(x) − λg(z) ≥ x − z, y − z − ε2/2 for every
x ∈ dom g, which proves that (y − z)/λ ∈ ∂ε2 2λg(z).
Proposition 2.4 states that for each ε > 0 one can get approximations of proximal
points in the sense of Definition 2.1 from approximations in the sense of (2.15) as
soon as the precision δ is chosen small enough.
Corollary 2.5. Let g : H → R be proper, convex, and l.s.c. with dom g ∗
bounded, and y, z ∈ H. For any ε > 0, if 0 < σ ≤ diam(dom g ∗ ) and σλ2 diam(dom g ∗ )
≤ ε2 /2, then
(2.16) z σ proxg∗ /λ (y/λ) =⇒ y − λz ε proxλg (y) .
In particular, suppose g is positively homogeneous (i.e., g(αx) = αg(x) for α ≥ 0).
Then, setting K := ∂g(0), if 0 < σ ≤ λdiamK and σλdiamK ≤ ε2 /2, it holds
(2.17) z σ PλK (y) =⇒ y − z ε proxλg (y) .
Proof. Set η = ε/λ. Then the condition σλ2 diam(dom g ∗ ) ≤ ε2 /2 is equivalent
to σdiam(dom g ∗ ) ≤ η 2 /2. Therefore, by applying Proposition 2.4 to the function g ∗ ,
we obtain
z σ proxg∗ /λ (y/λ) =⇒ z η proxg∗ /λ (y/λ) .
Then, the inexact Moreau decomposition (2.6) gives y − λz ε proxλg (y).
Ak
(3.6) ϕk (x) = (ϕk )∗ + x − νk 2 ,
2
and Ak , νk , and (ϕk )∗ can be recursively derived from the parameters (zk , ηk , ξk , αk )k∈N ,
(3.7)
⎧
⎪ Ak+1 = (1 − αk )Ak ,
⎪
⎪
⎪
⎨ αk
νk+1 = νk − ξk+1 ,
(1 − αk )Ak
⎪
⎪
⎪
⎪ α2k
⎩ (ϕk+1 )∗ = (1 − αk )(ϕk )∗ + αk (F (zk+1 ) + νk − zk+1 , ξk+1 − ηk ) − ξk+1 2 .
2Ak+1
Next, it remains to generate a sequence (xk )k∈N satisfying inequality (3.2) and to
study the asymptotic behavior of βk . To this aim we recall two lemmas, whose proofs
are provided in [52], that will be essential in the derivation of the algorithm.
Lemma 3.3. Suppose for some k ∈ N, ϕk is defined as in (3.6) and ϕk+1 according
to (3.4) with ξk+1 ∈ ∂ηk F (zk+1 ). If xk ∈ H satisfies F (xk ) ≤ (ϕk )∗ + δk for some
δk ≥ 0, then, setting yk = (1 − αk )xk + αk νk , for any λ > 0, it holds
λ α2k
(1−αk )δk +ηk +(ϕk+1 )∗ ≥ F (zk+1 )+ 2− ξk+1 2 + yk −(λξk+1 +zk+1 ), ξk+1 .
2 Ak+1 λ
Lemma 3.4. Given the sequence (λk )k∈N , λk ≥ λ > 0 and A > 0, a, b > 0, a ≤ b,
define (Ak )k∈N and (αk )k∈N recursively, such that A0 = A and for k ∈ N
α2k
αk ∈ [0, 1), with a ≤ ≤ b,
(1 − αk )Ak λk
Ak+1 = (1 − αk )Ak .
L
(4.1) F (x) ≥ F (z) + x − z, ∇f (y) + ζ − z − y2 − ε.
2
In other words, ∇f (y) + ζ ∈ ∂η F (z), with η = L/2z − y2 + ε .
Proof. Fix x, y, z ∈ H. Since ∇f is L-Lipschitz continuous, it holds
L
(4.2) f (y) ≥ f (z) − z − y, ∇f (y) − z − y2 .
2
On the other hand, being f convex, we have f (x) ≥ f (y) + x − y, ∇f (y), which
combined with (4.2) gives
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
L
(4.3) f (x) ≥ f (z) + x − z, ∇f (y) − z − y2 .
2
Since g is convex and ζ ∈ ∂ε g(z), we have g(x) ≥ g(z) + x − z, ζ − ε, that summed
with (4.3), gives the statement.
Combining Lemma 4.1 with Lemma 3.3, we derive the following result.
Lemma 4.2. Suppose for some k ∈ N, ϕk is defined as in (3.6) and xk ∈ H
satisfies F (xk ) ≤ (ϕk )∗ + δk for some δk ≥ 0. Set yk = (1 − αk )xk + αk νk . For any
εk ≥ 0, λk > 0, let
yk − xk+1 L ε2
xk+1 εk proxλk g (yk − λk ∇f (yk )), ξk+1 = , ηk = yk − xk+1 2 + k .
λk 2 2λk
Then ξk+1 ∈ ∂ηk F (xk+1 ), and if ϕk+1 is defined according to (3.4) with zk+1 = xk+1 ,
(4.4)
ε2 λk α2k
(1 − αk )δk + k + (ϕk+1 )∗ ≥ F (xk+1 ) + 2 − λk L − ξk+1 2 .
2λk 2 (1 − αk )Ak λk
Proof. Recalling Definition 2.1, since xk+1 εk proxλk g (yk − λk ∇f (yk )), we have
yk − xk+1
(4.5) ζk+1 := − ∇f (yk ) ∈ ∂ε2k /(2λk ) g(xk+1 ) .
λk
Therefore Lemma 4.1 gives ξk+1 = ∇f (yk ) + ζk+1 ∈ ∂ηk F (xk+1 ) and Lemma 3.3 gives
(4.6)
ε2k λk α2k
(1 − αk )δk + + (ϕk+1 )∗ ≥ F (xk+1 ) + 2− ξk+1 2
2λk 2 (1 − αk )Ak λk
L
+ yk − (λk ξk+1 + xk+1 ), ξk+1 − yk − xk+1 2 .
2
Now, since yk = λk ξk+1 + xk+1 , the scalar product on the right-hand side of (4.6) is
zero, and (4.4) follows.
We are now ready to define a general accelerated and inexact forward-backward
splitting (AIFB) algorithm and to prove its convergence rate.
Theorem 4.3. For fixed numbers t0 > 1, a ∈ ]0, 2[, sequences of parameters
(λk )k∈N , λk ∈ ]0, (2 − a)/L] and (ak )k∈N such that a ≤ ak ≤ 2 − λk L, and a sequence
of errors (εk )k∈N with εk ≥ 0, we choose x0 = y0 ∈ dom g and for every k ∈ N, we
recursively define
1 + 1 + 4(ak λk )t2k /(ak+1 λk+1 )
tk+1 = ,
2
(AIFB) xk+1 εk proxλk g (yk − λk ∇f (yk )) ,
tk − 1 tk
yk+1 = xk+1 + (xk+1 − xk ) + (1 − ak ) (yk − xk+1 ) .
tk+1 tk+1
Then, setting zk+1 = xk+1 , ξk+1 = (yk − xk+1 )/λk , ηk = L/2yk − xk+1 2 + ε2k /(2λk ),
and αk = t−1k , the sequence (ϕk )k∈N defined according to (3.4) starting from ϕ0 =
F (x0 ) + A0 /2 · −x0 2 , with A0 = 1/(t0 (t0 − 1)a0 λ0 ), is an estimate sequence for F
and it holds
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
ck
(4.7) δk+1 + (ϕk+1 )∗ ≥ F (xk+1 ) + yk − xk+1 2 ,
2λk
λk ak 2
(4.8) t2k+1 − tk+1 − t = 0.
λk+1 ak+1 k
Since αk = t−1 k ∈ (0, 1) then, from (4.8), we have (1 − αk+1 )ak+1 λk+1 /α2k+1 =
2
ak λk /αk , and hence
α2k α2k+1
(4.9) (1 − αk ) = .
(1 − αk )ak λk (1 − αk+1 )ak+1 λk+1
If we set Ak = α2k /[(1 − αk )ak λk ], (4.9) turns into Ak (1 − αk ) = Ak+1 as in (3.7), and
the inequality ak ≤ 2 − λk L gives
α2k
(4.10) + λk L ≤ 2 .
(1 − αk )Ak λk
(4.12) αk νk = yk − (1 − αk )xk ,
αk+1 νk+1 = yk+1 − (1 − αk+1 )xk+1
and hence, substituting into (4.11) and recalling the definition of Ak , we get
αk
(4.13) νk+1 = νk − (yk − xk+1 ) .
(1 − αk )Ak λk
If we set ξk+1 = (yk − xk+1 )/λk , (4.13) becomes as in (3.7). Now define (ϕk )k∈N
according to (3.4) using the parameters (xk , ηk , ξk , αk )k∈N and starting from ϕ0 =
F (x0 ) + A0 /2 · −x0 2 . Then ϕk = (ϕk )∗ + Ak /2 · −νk 2 for every k ∈ N and we
have δ0 + (ϕ0 )∗ ≥ F (x0 ). Reasoning by induction, and using Lemma 4.2, we obtain
ξk+1 ∈ ∂ηk F (xk+1 ) and (4.7). Finally note that, since by assumption and (4.10),
a ≤ α2k /((1 − αk )Ak λk ) ≤ 2, Lemma 3.4 ensures that βk = i=0 (1 − αi ) tends
k−1
to 0.
Remark 3 (retrieving FISTA[7]). In the initialization step of (AIFB), we are
allowed to choose t0 = 1, as soon as a0 = 1. Indeed, as one can easily check, with
these choices we get t1 > 1 and y1 = x1 . Therefore the sequences continue as if
they started from (t1 , x1 , y1 ). This shows that algorithm (AIFB) includes FISTA by
choosing ak = 1 and λk = λ ≤ 1/L, starting with t0 = 1. Moreover, for f = 0 and
ak = 2, we also obtain the proximal point algorithm given in the appendix of [30].
αk updates as follows:
1 ak+1 λk+1 2 4 ak+1 λk+1 2 ak+1 λk+1 2
(4.14) αk+1 = αk + 4 αk − αk .
2 ak λk ak λk ak λk
Proof. By Theorems 3.2 and 4.3, it is enough to study the asymptotic behavior
of the sequences βk and δk . Since λk ∈ [λ, (2 − a)/L], by Lemma 3.4, βk ∼ 1/k 2 .
Concerning the structure of the error term δk , it is easy to prove (see Lemma 3.3
in [30]) that the solution of the difference equation δk+1 = (1 − αk )δk + ε2k /(2λk ),
obtained in Theorem 4.3, is given by
βk ε2i
k−1
(4.15) δk = .
2 i=0 λi βi+1
Now suppose that Ψλ (vn ) − Ψλ (v) = O(1/n2p ). Then, the first part of statement
(5.3) directly follows from (5.4). Regarding the rate on the duality gap, note that the
function Φλ is Lipschitz continuous on bounded sets, being convex and continuous.
Thus there exists L1 > 0 such that
√ 1/2
Φλ (zn ) − Φλ (z) ≤ L1 zn − z ≤ L1 2λ (Ψλ (vn ) − Ψλ (v)) .
This shows that the convergence rate stated for the duality gap in (5.3) holds.
In order to compute admissible approximations of the proximal point, we can
choose any minimizing algorithm for the dual problem. A simple choice is the forward-
backward splitting algorithm (called also ISTA [7]). Since √ for this choice Ψλ (vn ) −
Ψλ (v) = O(1/n), this gives the rate G(zn , vn ) = O(1/ n) for the duality gap. We
remark that the pair of sequences (y − λB ∗ vn , vn ) corresponds exactly to the pair
(xn , yn ) generated by the primal-dual Algorithm 1 proposed in [14] when applied to
1
the minimization of Φλ (x) = g(x) + 2λ x − y2 (τ = λ, θ = 1).
A more efficient choice is FISTA, resulting in the rate G(zn , vn ) = O(1/n). The
latter will be our choice in the numerical section. For the case of ω positively homoge-
neous (e.g., total variation), it holds ω ∗ = δS , with S = ∂ω(0) and the corresponding
dual minimization problem min Ψλ becomes a constrained smooth optimization prob-
lem. Then, FISTA reduces to an accelerated projected gradient descent algorithm
γn 1
(5.6) vn+1 = PS un − B(λB ∗ un − y) , 0 < γn ≤ ,
λ B2
tn − 1
un+1 = vn+1 + (vn+1 − vn )
tn+1
with the usual choices for tn (see Remark 3). We note that in this case Propositions 2.2
and 2.3 ensure that problem (5.1) is equivalent to (5.2).
Remark 6. We highlight that the results in Theorem 5.1 hold for the more general
setting of a minimization problem of the form
where dom ϕ = X and ϕ is c-strongly convex and differentiable with L-Lipschitz con-
tinuous gradient.2 Indeed, in this case one has z = ∇ϕ∗ (−B ∗ v), zn = ∇ϕ∗ (−B ∗ vn )
and the strong convexity of ϕ∗ allows one to get the analogous bound of (5.4)
c2
zn − z2 ≤ Ψ(vn ) − Ψ(v) ,
2L
2 This is equivalent to requiring ϕ∗ strongly convex and differentiable with Lipschitz continuous
Dλ
(5.8) , p > 0,
ε2/p
iterations,3 we can bound the total iteration complexity of the AIFB algorithm. From
Theorem 4.4, if we let εk := 1/k q , and take k ≥ Ne , with
1
(C/ε) 2q−1 if 1/2 < q < 3/2 ,
Ne := 1
(C/ε) 2 if q > 3/2 ,
we have F (xk ) − F∗ ≤ ε, where C > 0 is the constant masked in the rates given in
Theorem 4.4. Now for each k ≤ Ne , from the hypothesis (5.8) on the complexity of the
2/p
internal algorithm, one needs at most Dλk /εk = Dλk k 2q/p internal iterations to get
an approximate proximal point xk+1 in (AIFB) with precision εk = 1/k q . Summing
all the internal iterations from 1 to Ne , and if λk ≤ λ, we have
Ne Ne
2q/p Dλ
Ni = Dλk k ≤ Dλ t2q/p dt = N 2q/p+1
0 2q/p + 1 e
k=1
and hence
⎧ 2q/p+1
⎨O 1/ε 2q−1 if 1/2 < q < 3/2 ,
Ni = 2q/p+1
⎩O 1/ε 2 if q > 3/2 .
Adding the costs of internal and external iterations together, we derive the following
proposition.
Proposition 5.2. Suppose problem (5.1) is solved in at most Dλ/ε2/p iterations,
for some constants p > 0 and D > 0. Then, the global iteration complexity Cg of
(AIFB) plus the inner algorithm is
⎧ 2q/p+1
⎨O 1/ε 2q−1 + O 1/ε 2q−1 1
if 1/2 < q < 3/2 ,
(5.9) C g = ci N i + c e N e = 2q/p+1
⎩O 1/ε 2 1
+ O 1/ε 2 if q > 3/2 ,
the end on y. If dom ω ∗ is bounded, D can be chosen independently of y, since for most algorithms
it is majorized by diam(domω ∗ ).
From the estimates above, one can easily see that, in each case, the lower global
complexity is reached for q → 3/2 and it is
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
p+3
Cg = O(1/ε 2p +δ )
for whatever small δ > 0. For p = 1, as it is in the case of algorithm (5.6), one
obtains a complexity of O(1/ε2+δ ). For p = 1/2, which corresponds to the rate of the
algorithm studied in [19], we have a global complexity of O(1/ε7/2+δ ). We finally note
that for p → +∞ we have a complexity of O(1/ε1/2+δ ), with δ that can be chosen
arbitrarily small. In other words, the algorithm behaves as an accelerated method.
We remark that the analysis of the global complexity given above is valid only
asymptotically, since we did not estimate any of the constants hidden in the O sym-
bols. However, in real situations constants do matter and, in practice, the most
effective accuracy rate q is problem dependent and might be different from 3/2, as we
illustrate in the experiments of subsection 6.3.
6. Numerical experiments. In this section, we present two types of experi-
ments. The first one is designed to illustrate the influence of the errors on the behavior
of AIFB and on its nonaccelerated counterpart IFB (called ISTA in [7]). The second
one is meant to measure the performance of the two-loops algorithm AIFB+algorithm
(5.6), in comparison with IFB+algorithm (5.6), and with the primal-dual algorithm
proposed in [14].
6.1. Experimental set-up. In all the following cases, we consider the regular-
ized least-squares functional
1
(6.1) F (x) := Ax − y2Y + g(x) ,
2
where H, Y are Euclidean spaces, x ∈ H, y ∈ Y, A : H → Y is a linear operator, and
g : H → R is of type (2.7). In all cases ω will be a norm and the projection onto
S = ∂ω(0) will be explicitly computable.
We minimize F using AIFB, with λk = λ = 1/L, where L = A∗ A. We use
ak = 1 (corresponding to FISTA), since we empirically observed that the choice of
ak , if independent of k, does not significantly influence the speed of convergence of
the algorithm (although preliminary tests revealed a slightly better performance for
ak = 0.8). At each iteration, we employ algorithm (5.6) to approximate the proximity
operator of g up to a precision εk . The stopping rule for the inner algorithm is
given by the duality gap, according to Proposition 2.3, item (a). Following Theorem
4.4, we consider sequences of errors of type εk = C/k q , with q, hereafter referred as
accuracy rate, chosen between 0.1 and 1.7. The coefficient C should be comparable
to the magnitude of the duality gap. In fact, it determines the practical constraint
on the duality gap at the first iterations: the constraint should be active, but not
too demanding to avoid unnecessary precision. We choose C by solving the equation
G(y0 − λ∇f (y0 ), 0) = C 2 /(2λ), where G is the duality gap corresponding to the first
proximal subproblem encountered in AIFB for k = 0, evaluated at v0 = 0. We finally
consider an “exact” version, obtained by solving the proximal subproblems at the
machine precision.
We analyze two well-known problems: deblurring with total variation regulariza-
tion and learning a linear estimator via regularized empirical risk minimization with
the overlapping group lasso penalty. The numerical experiments are divided into two
parts. In the first one, we evaluate the impact of the errors on the convergence rate
nal iterations for different accuracy rates on the error is shown. We underline that this
study is independent of the algorithm chosen to produce an admissible approximation
of the proximal points.
In the second part, we assess the overall behavior of the two-loops algorithm, as
described in section 5, using algorithm (5.6) to solve the proximal subproblems. We
compare it with the nonaccelerated version (IFB) and the primal-dual (PRIDU) algo-
rithm proposed by [14] for image deconvolution. For all algorithms we provide CPU
time, and the number of external and internal iterations for different precisions. Note
that the cost of each external iteration relies mainly in the evaluation of the gradient
of the quadratic part of the objective function (6.1). The internal iteration has a sim-
ilar form, but being the matrix B is sparse and structured in both experiments, can
be implemented in a fast way. All the numerical experiments have been performed in
the MATLAB environment,4 on a desktop iMac with Intel Core i5 CPU, 2.5 Ghz, 6
MB cache L3, and 6 GB of RAM.
6.1.1. Deblurring with total variation. Regularization with total variation
[50, 12, 6] is a widely used technique for deblurring and denoising images, that pre-
serves sharp edges.
In this problem, H = Y = RN ×N is the space of (discrete two dimensional) images
on the grid [1, N ]2 , A is a linear map representing some blurring operator [6], and y
is the observed noisy and blurred datum. The (discrete) total variation regularizer is
defined as
N
g = ω ◦ ∇, g(x) = τ (∇x)i,j 2 ,
i,j=1
2
where ∇ : H → H is the (discrete) gradient operator (see [12] for the precise defini-
tion) and ω : H2 → R, ω(p) = τ i,j=1 pi,j 2 with τ > 0 a regularization parameter,
N
and ·2 the euclidean norm in R2 . Note that the matrix corresponding to ∇ is highly
sparse (it is bidiagonal). This feature has been taken into account to get an efficient
implementation.
We followed the same experimental setup as in [6]. We considered the 256 × 256
Lena test image, blurred by a 9 × 9 Gaussian blur with standard deviation 4, followed
by additive normal noise with zero mean and standard deviation 10−3 . The regular-
ization parameter τ was set to 10−3 . Since the blurring operator A is a convolution
operator, in the implementation it is common to evaluate it by an FFT-based method
(see, e.g., [6, 14]).
6.1.2. Overlapping group lasso. The group lasso penalty is a regularization
term for ill-posed inverse problems arising in statistical learning [64, 33], image pro-
cessing and compressed sensing [46], and enforcing structured sparsity in the solutions.
Regularization with this penalty consists in solving a problem of the form (6.1), where
H = Rp , Y = Rm , A is a data or design matrix, and y is a vector of outputs or mea-
surements. Following [33], the overlapping group lasso (OGL) penalty is
⎛ ⎞1/2
r
(6.2) g(x) = τ ⎝ (wji )2 x2j ⎠ ,
i=1 j∈Ji
aij
1
wji = , with aij = #{J ∈ J : j ∈ J, J ⊂ Ji , J = Ji }.
2
B i : R p → RJ i , Bi x = (wji xj )j∈Ji ,
r
and ω : RJ1 × · · · RJr → R, ω(v1 , . . . , vr ) = τ i=1 vi 2 , where · 2 is the euclidean
norm in RJi .
The matrix A and the datum y are generated from the breast cancer dataset
provided by [62]. The dataset consists of expression data for 8,141 genes in 295
breast cancer tumors (78 metastatic and 17 nonmetastatic). The groups are defined
according to the canonical pathways from MSigDB [60], that contains 639 groups of
genes, 637 of which involve genes from the breast cancer dataset. We restrict the
analysis to the 3510 genes that are contained in at least one group. Hence, our data
matrix A consists of 295 different expression levels of 3510 genes. The output vector y
contains the labels (±1, metastatic, or nonmetastatic) for each sample. The structure
of the overlapping groups gives rise to a matrix B of size 15126 × 3510. Despite the
high dimensionality, one can take advantage of its sparseness. We analyze two choices
of the regularization parameter: τ = 0.01 and τ = 0.1.
6.2. Results—Part I. We run AIFB and its nonaccelerated counterpart, IFB,
up to 2.000 external iterations. With the aim of maximizing the effect of inexactness,
we require algorithm (5.6) to produce solutions with errors close to the upper bounds
2k /2λ prescribed by the theory. We achieve this by reducing the internal step-size
length γn and using cold restart, i.e., initializing at each step algorithm (5.6) with
v0 = 0.
As a reference optimal value, F∗ , we use the value found afters 10,000 iterations
of AIFB with error rate q = 1.7.
As shown in Figure 6.1, the empirical convergence rate of (F (xk ) − F∗ )/F∗ is
indeed affected by the accuracy rate q: to smaller values of q correspond slower
convergence rates both for AIFB and the inexact (nonaccelerated) forward-backward
algorithm. When the errors in the computation of the proximity operator do not
decay fast enough, the convergence rates are much deteriorated and the algorithms
can even not converge to the infimum. If the errors decay sufficiently fast, AIFB
shows a faster convergence w.r.t. IFB in both experiments. In contrast, this is not
true for accuracy rates q < 1, where IFB has practically the same behavior as AIFB.
Moreover, it turns out that AIFB is more sensitive to errors than IFB. This is
more evident in the experiment on TV deblurring. Indeed, for AIFB most curves
corresponding to the different accuracy rates are well separated, while for IFB they
are closer to each other, and often completely overlapped. Yet, the overlapping phe-
nomenon in general starts earlier (lower q) for IFB than AIFB, indicating that no gain
is obtained in increasing the accuracy error rates over a certain level, in accordance
with the theoretical results.
6.3. Results—Part II. This section is the empirical counterpart of subsec-
tion 5.2. Here, we test the global iteration complexity of AIFB and inexact IFB
combined with algorithm (5.6) on the two problems described above. We provide the
+ +
+ +
+ +
+
+
+ +
+
+
+
+
+
+
+
+
+ +
Fig. 6.1. Impact of the errors on AIFB and IFB. Log-log plots of relative objective value versus
external iterations, k, obtained for TV deblurring (upper row) and the OGL problem with regular-
ization parameter τ = 10−1 (bottom row). The AIFB and inexact IFB for different accuracy rates
q in the computation of the proximity operator are shown in the left and right column, respectively.
For larger values of the parameter q the curves overlap. It can be seen from visual inspection, that
the errors affect the acceleration.
number of external iterations and the total number of inner iterations. When taking
into account the cost of computing the proximity operator, there is a trade-off between
the number of external and internal iterations. Since internal and external iterations
in general have different computational costs—which depend on the specific problem
considered and the machine CPU—the total number of iterations is not a good mea-
sure of the algorithm’s performance. For instance, on our computer, the ratio between
the cost of the external and internal iteration is about 2.15 in the TV deblurring and
2.5 in the OGL problem. Therefore, we also report the CPU time needed to reach
a desired accuracy for the relative difference from the optimal value. In this part,
we use the warm-restart procedure, consisting in initializing algorithm (5.6) with the
solution obtained at the previous step. We empirically observed that this initializa-
tion strategy drastically reduces the total number of iterations and speeds up the
algorithm.
We compare AIFB and IFB with PRIDU taken as a benchmark, since it often
outperforms state-of-the-art methods, in particular for TV regularization (see the
numerical section in [14]).
Algorithm PRIDU depends on two parameters5 σ, ρ > 0. In our experiments, we
tested two choices, indicated by the authors (in the paper and code as well) for the
image deblurring and denoising problem: σ = 10 and ρ = 1/(σB2 ), and ρ = 0.01
(corresponding to σ = 1/(ρB2 ) = 12.5 for the TV problem and σ 1.07 for the
OGL problem). We also implemented the algorithm for the OGL problem and, as
On the other hand, AIFB and IFB depend on the accuracy rate q. We verified
that the best empirical results are obtained choosing q in the range [1, 1.5] for AIFB
and [0.1, 0.5] for IFB. This once more confirms the higher sensitivity to the errors of
the accelerated version w.r.t. the basic one. In Tables 6.1–6.3, we detail the results
only for the most significant choices of q. We remark that the “exact” version of AIFB
(and IFB), where the prox is computed at machine precision at each step, is not even
comparable to the results we reported here.
Table 6.1
Deblurring with TV regularization, τ = 10−3 . Performance evaluation of AIFB, IFB, and
PRIDU, corresponding to different choices of the parameters q and σ, respectively. Concerning
AIFB and IFB, the results are reported only for the q’s giving the best results. The entries in the
table refer to the CPU time (in seconds) needed to reach a relative difference w.r.t. to the optimal
value below the thresholds 10−4 , 10−6 , and 10−8 , the number of external iterations (# Ext), and
the total number of internal iterations (# Int).
Table 6.2
Breast cancer dataset: OGL τ = 10−1 . Performance evaluation of AIFB, IFB, and PRIDU,
corresponding to different choices of the parameters q and σ, respectively. Concerning AIFB and
IFB, the results are reported only for the q’s giving the best results. The entries in the table refer to
the CPU time (in seconds) needed to reach a relative difference w.r.t. to the optimal value below the
thresholds 10−4 , 10−6 , and 10−8 , the number of external iterations (# Ext), and the total number
of internal iterations (# Int).
Table 6.3
Breast cancer dataset: OGL, τ = 10−2 . See caption of Table 6.2.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
The above rates can be obtained relying on our techniques, and are slower than the
ones given in Theorem 4.4. This is in line with what was obtained in [52, section 4.1].
In [54], using different techniques, the convergence rate O(1/k 2 ) is proved.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
y − λ∇f (y) − x − ē
ζ := ∈ ∂ ε21 g(x).
λ 2λ
Remark 7. In the proof of Theorem 4.3, the set up of the parameters defining
the estimate sequence for AIFB does not depend on the notion of inexactness for
the proximal point. More precisely, starting from the AIFB algorithm, but with the
notion of inexact prox (2.15), the same auxiliary sequences (αk )k∈N , (Ak )k∈N , (νk )k∈N
can be introduced and all the equations (4.9)–(4.13) remain true. In particular,
(A.2) yk = (1 − αk )xk + αk νk ,
αk
(A.3) νk+1 = νk − (yk − xk+1 ) .
(1 − αk )Ak λk
The critical point is that now we cannot argue ξk+1 = (yk − xk+1) /λk ∈ ∂ηk F (xk+1 )
anymore, since Lemma 4.2 requires xk+1 to be an inexact prox in the sense of (2.3).
Hence the construction of the estimate sequence cannot be finalized.
The following lemma overcomes this situation by introducing an estimate sequence
centered on new points uk ’s, which are “close” to the νk ’s. It is the analogue of
Lemma 4.2 for errors of type (2.15).
Lemma A.3. Suppose for some k ∈ N, xk , uk , νk ∈ H, Ak > 0, ϕk = (ϕk )∗ +
Ak /2 · −uk 2 are such that F (xk ) ≤ (ϕk )∗ + δk and νk − uk ≤ γk for some
α2k
γk , δk ≥ 0. Let λk > 0, αk ∈ [0, 1), and assume (1−αk )A k λk
≤ 1 and λk L ≤ 1. Set
yk = (1−αk )xk +αk νk , wk = (1−αk )xk +αk uk , and xk+1 εk proxλk g (yk −λk ∇f (yk ))
for some εk ≥ 0. Then there exist ek ∈ H, ε1k , ε2k > 0, ε21k + ε22k ≤ ε2k , ek ≤
ε2k + αk γk such that, if ϕk+1 is defined according to (3.4) with zk+1 = xk+1 and
wk − xk+1 − ek L ε2
ξk+1 = , ηk = wk − xk+1 2 + 1k ,
λk 2 2λk
we have ξk+1 ∈ ∂ηk F (xk+1 ) and
(εk + αk γk )2
(A.4) (1 − αk )δk + + (ϕk+1 )∗ ≥ F (xk+1 ) .
2λk
Moreover, if νk is updated according to (A.3) and uk+1 is the center of ϕk+1 (which
is defined according to the second equation in (3.7)), it holds
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
for some ek ∈ H, ε1k , ε2k > 0 with ε21k +ε22k ≤ ε2k , ek ≤ ε2k +αk γk . From Lemma 4.1
it follows ξk+1 := (wk −xk+1 −ek )/λk = ∇f (wk )+ζk+1 ∈ ∂ηk F (xk+1 ). Then applying
Lemma 3.3, we have
ε21k
(1 − αk )δk + + (ϕk+1 )∗
2λk
λk α2k
≥ F (xk+1 ) + 2− ξk+1 2 + wk − (λk ξk+1 + xk+1 ), ξk+1
2 Ak+1 λk
L
− wk − xk+1 2
2
1 α2k
= F (xk+1 ) − λk ξk+1 2 − 2 wk − xk+1 , λk ξk+1
2λk Ak+1 λk
2
+ λk Lwk − xk+1
1
≥ F (xk+1 ) − wk − xk+1 − λk ξk+1 2
2λk
1
= F (xk+1 ) − ek 2 ,
2λk
where in the last inequality we use the assumptions α2k /(Ak+1 λk ) ≤ 1 and λk L ≤ 1.
Moreover, from the definitions of ε1k , ε2k , and ek it holds
(εk + αk γk )2 ≥ ε21k + ε22k + 2εk αk γk + (αk γk )2 ≥ ε21k + (ε2k + αk γk )2 ≥ ε21k + ek 2
and (A.4) follows. To prove (A.5), first note that from the definition we derive
1
(A.7) u k = νk + (wk − yk ) .
αk
Next, by (3.7), (A.7), (A.3), and the definition of ek in (A.1), we get
αk
uk+1 = uk − (wk − xk+1 − ek )
(1 − αk )Ak λk
αk 1
= νk − (yk − xk+1 ) + (wk − yk )
(1 − αk )Ak λk αk
αk
− (wk − yk − ek )
(1 − αk )Ak λk
1 α2k
= νk+1 + (wk − yk ) − λk (∇f (wk ) − ∇f (yk ))
αk (1 − αk )Ak λk
α2k
+ ēk .
(1 − αk )Ak λk
Therefore, recalling that by assumption α2k /((1 − αk )Ak λk ) ≤ 1 and ēk ≤ ε2k , the
Baillon–Haddad theorem implies
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1 εk
uk+1 − νk+1 ≤ wk − yk + ε2k ≤ γk + .
αk αk
Proof of Theorem A.1. Taking into account Remark 7 and reasoning by induction,
Lemma A.3 ensures that there exist sequences (ξk )k∈N , (ηk )k∈N , such that ξk+1 ∈
∂ηk F (xk+1 ) and the sequence (ϕk )k∈N constructed according to (3.4) with zk+1 =
xk+1 , starting from ϕ0 = F (x0 ) + A0 /2 · −u0 2 and u0 = x0 , satisfies
δk + (ϕk )∗ ≥ F (xk )
with δ0 = 0 and
(αk γk+1 )2 εk
δk+1 = (1 − αk )δk + , γk+1 = γk + , γ0 = 0 .
2λk αk
This shows that the sequence (δk )k∈N is actually the same studied in [52, IAPPA1].
The statement now follows from the subsequent Theorem 4.5 in [52].
REFERENCES
[1] Y. I. Alber, R. S. Burachik, and A. N. Iusem, A proximal point method for nonsmooth
convex optimization problems in Banach spaces, Abstr. Appl. Anal., 2 (1997), pp. 97–120.
[2] A. Argyriou, C. A. Micchelli, M. Pontil, L. Shen, and Y. Xu, Efficient First Order
Methods for Linear Composite Regularizers, preprint, arXiv:1104.1436v1, 2011.
[3] A. Auslender, Numerical methods for nondifferentiable convex optimization, Nonlinear Anal-
ysis and Optimization, Math. Programming Stud., (1987), pp. 102–126.
[4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing
penalties, Found. Trends Mach. Learn., 4 (2012), pp. 1–106.
[5] H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex
Anal., 17 (2010), pp. 781–787.
[6] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained total variation
image denoising and deblurring, IEEE Trans. Image Process., 18 (2009), pp. 2419–2434.
[7] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202.
[8] S. Becker, J. Bobin, and E. Candès, NESTA: A fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sci., 4 (2011), pp. 1–39.
[9] S. Bonettini and V. Ruggiero, On the convergence of primal-dual hybrid gradient algorithms
for total variation image restoration, J. Math. Imaging Vision, 44 (2012), pp. 1–18.
[10] K. Bredies, A forward-backward splitting algorithm for the minimization of non-smooth convex
functionals in Banach space, Inverse Problems, 25 (2009), 015005.
[11] R. S. Burachik and B. F. Svaiter, A relative error tolerance for a family of generalized
proximal point methods, Math. Oper. Res., 26 (2001), pp. 816–831.
[12] A. Chambolle, An algorithm for total variation minimization and applications, J. Math.
Imaging Vision, 20 (2004), pp. 89–97.
[13] A. Chambolle and P.-L. Lions, Image recovery via total variation minimization and related
problems, Numer. Math., 76 (1997), pp. 167–188.
[14] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with
applications to imaging, J. Math. Imaging Vision, 40 (2011), pp. 120–145.
[15] C. Chaux, J.-C. Pesquet, and N. Pustelnik, Nested iterative algorithms for convex con-
strained image recovery problems, SIAM J. Imaging Sci., 2 (2009), pp. 730–762.
[16] P. L. Combettes, D. Dũng, and B. C. Vũ, Dualization of signal recovery problems, Set-
Valued Var. Anal., 18 (2010), pp. 373–404.
[17] P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal processing, in
Fixed-Point Algorithms for Inverse Problems in Science and Engineering, H. H. Bauschke,
R. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, eds., Springer-
Verlag, New York, 2011, pp. 185–212.
[45] Y. Nesterov, Gradient Methods for Minimizing Composite Objective Function, Technical re-
port, CORE Discussion Papers from Université Catholique de Louvain, Center for Opera-
tions Research and Econometrics No 2007/076, 2009.
Downloaded 08/19/14 to 140.117.111.1. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
[46] G. Peyré and J. Fadili, Group sparsity with overlapping partition functions, in Proc. EU-
SIPCO 2011, Barcelona, 2011, pp. 303–307.
[47] R. T. Rockafellar, Augmented Lagrangians and applications of the proximal point algorithm
in convex programming, Math. Oper. Res., 1 (1976), pp. 97–116.
[48] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control
Optim., 14 (1976), pp. 877–898.
[49] L. Rosasco, S. Mosci, M. S. Santoro, A. Verri, and S. Villa, A regularization approach
to nonlinear variable selection, JMLR Workshop Conf. Proc., 9 (2010), pp. 653–660.
[50] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algo-
rithms, Phys. D, 60 (1992), pp. 259–268.
[51] A. Sabharwal and L. C. Potter, Convexly constrained linear inverse problems: Iterative
least-squares and regularization, IEEE Trans. Signal Process., 46 (1998), pp. 2345–2352.
[52] S. Salzo and S. Villa, Inexact and accelerated proximal point algorithm, J. Convex Anal., 19
(2012).
[53] O. Scherzer, M. Grasmair, H. Grossauer, M. Haltmeier, and F. Lenzen, Variational
Methods in Imaging, Appl. Math. Sci. 167, Springer, New York, 2009.
[54] M. Schmidt, N. Le Roux, and F. Bach, Convergence rates of inexact proximal-gradient
methods for convex optimization, in Advances in Neural Information Processing Systems
24, 2011.
[55] M. V. Solodov and B. F. Svaiter, A hybrid approximate extragradient-proximal point algo-
rithm using the enlargement of a maximal monotone operator, Set-Valued Anal., 7 (1999),
pp. 323–345.
[56] M. V. Solodov and B. F. Svaiter, A comparison of rates of convergence of two inexact
proximal point algorithms, in Nonlinear Optimization and Related Topics (Erice, 1998),
Appl. Optim. 36, Kluwer Academic, Dordrecht, 2000, pp. 415–427.
[57] M. V. Solodov and B. F. Svaiter, Error bounds for proximal point subproblems and associ-
ated inexact proximal point algorithms, Math. Program., 88 (2000), pp. 371–389.
[58] M. V. Solodov and B. F. Svaiter, An inexact hybrid generalized proximal point algorithm
and some new results on the theory of Bregman functions, Math. Oper. Res., 25 (2000),
pp. 214–230.
[59] M. V. Solodov and B. F. Svaiter, A unified framework for some inexact proximal point
algorithms, Numer. Funct. Anal. Optim., 22 (2001), pp. 1013–1035.
[60] A. Subramanian et al., Gene set enrichment analysis: A knowledge-based approach for inter-
preting genome-wide expression profiles, Proc. Natl. Acad. Sci. USA, 102 (2005), p. 15545.
[61] P. Tseng, Approximation accuracy, gradient methods, and error bound for structured convex
optimization, Math. Program., 125 (2010), pp. 263–295.
[62] M. J. Van De Vijver et al., A gene-expression signature as a predictor of survival in breast
cancer, New England J. Med., 347 (2002), pp. 1999–2009.
[63] Y. Yao and N. Shahzad, Strong convergence of a proximal point algorithm with general errors,
Optim. Lett., 6 (2012), pp. 621–628.
[64] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J.
R. Stat. Soc. Ser. B Stat. Method, 68 (2006), pp. 49–67.
[65] C. Zălinescu, Convex Analysis in General Vector Spaces, World Scientific Publishing, River
Edge, NJ, 2002.
[66] A. J. Zaslavski, Convergence of a proximal point method in the presence of computational
errors in Hilbert spaces, SIAM J. Optim., 20 (2010), pp. 2413–2421.
[67] P. Zhao, G. Rocha, and B. Yu, The composite absolute penalties family for grouped and
hierarchical variable selection, Ann. Statist., 37 (2009), pp. 3468–3497.