The Stochastic Multi-Gradient Algorithm For Multi-Objective Optimization and Its Application To Supervised Machine Learning
The Stochastic Multi-Gradient Algorithm For Multi-Objective Optimization and Its Application To Supervised Machine Learning
February 8, 2021
arXiv:1907.04472v3 [math.NA] 5 Feb 2021
Abstract
Optimization of conflicting functions is of paramount importance in decision making,
and real world applications frequently involve data that is uncertain or unknown, resulting
in multi-objective optimization (MOO) problems of stochastic type. We study the stochas-
tic multi-gradient (SMG) method, seen as an extension of the classical stochastic gradient
method for single-objective optimization.
At each iteration of the SMG method, a stochastic multi-gradient direction is calculated
by solving a quadratic subproblem, and it is shown that this direction is biased even when
all individual gradient estimators are unbiased. We establish rates to compute a point in the
Pareto front, of order similar to what is known for stochastic gradient in both convex and
strongly convex cases. The analysis handles the bias in the multi-gradient and the unknown
a priori weights of the limiting Pareto point.
The SMG method is framed into a Pareto-front type algorithm for calculating an approx-
imation of the entire Pareto front. The Pareto-front SMG algorithm is capable of robustly
determining Pareto fronts for a number of synthetic test problems. One can apply it to any
stochastic MOO problem arising from supervised machine learning, and we report results
for logistic binary classification where multiple objectives correspond to distinct-sources data
groups.
1 Introduction
In multi-objective optimization (MOO) one attempts to simultaneously optimize several, poten-
tially conflicting functions. MOO has wide applications in all industry sectors where decision
making is involved due to the natural appearance of conflicting objectives or criteria. Appli-
cations span across applied engineering, operations management, finance, economics and social
∗
Department of Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem,
PA 18015-1582, USA ([email protected]).
†
Department of Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem,
PA 18015-1582, USA and Centre for Mathematics of the University of Coimbra (CMUC) ([email protected]).
Support for this author was partially provided by FCT/Portugal under grants UID/MAT/00324/2019 and P2020
SAICTPAC/0011/2015.
1
sciences, agriculture, green logistics, and health systems. When the individual objectives are
conflicting, no single solution exists that optimizes all of them simultaneously. In such cases, the
goal of MOO is then to find Pareto optimal solutions (also known as efficient points), roughly
speaking points for which no other combination of variables leads to a simultaneous improve-
ment in all objectives. The determination of the set of Pareto optimal solutions helps decision
makers to define the best trade-offs among the several competing criteria.
We start by introducing an MOO problem [22, 37] consisting of the simultaneous minimiza-
tion of m individual functions
min H(x) = (h1 (x), . . . , hm (x))>
(1)
s.t. x ∈ X ,
where hi : Rn → R are real valued functions and X ⊆ Rn represents a feasible region. We say
that the MOO problem is smooth if all objective functions hi are continuously differentiable.
Assuming that no point may exist that simultaneously minimizes all objectives, the notion of
Pareto dominance is introduced to compare any two given feasible points x, y ∈ X . One says
that x dominates y if H(x) < H(y) componentwise. A point x ∈ X is a Pareto minimizer if
it is not dominated by any other point in X . The set of all Pareto minimizers includes the
possible multiple minimizers of each individual function. If we want to exclude such points,
one can consider the set of strict Pareto minimizers P, by rather considering a weaker form of
dominance (meaning x weakly dominates y if H(x) ≤ H(y) componentwise and H(x) 6= H(y))1 .
In this paper we can broadly speak of Pareto minimizers as the first-order optimality condition
considered will be necessary for both Pareto optimal sets. An important notion in MOO is the
Pareto front H(P), formed by mapping all elements of P into the decision space Rm , H(P) =
{H(x) : x ∈ P}.
2
different magnitudes. Sometimes, the choice of parameters can be problematic, e.g., producing
infeasibility in the –constraint method; 2) In the weighted-sum method, it is frequently observed
(even for convex problems) that an evenly distributed set of weights in a simplex fails to produce
an even distribution of Pareto minimizers in the front. 3) It might be impossible to find the
entire Pareto front if some of the objectives are nonconvex, as it is the case for the weighted-
sum method. There are scalarization methods which have an a posteriori flavor like the so-
called normal boundary intersection method [14], and which are able to produce a more evenly
distributed set of points on the Pareto front given an evenly distributed set of weights (however
solutions of the method subproblems may be dominated points in the nonconvex case [27]).
Nonscalarizing a posteriori methods attempt to optimize the individual objectives simul-
taneously in some sense. The methodologies typically consist of iteratively updating a list of
nondominated points, with the goal of approximating the Pareto front. To update such it-
erate lists, some of these a posteriori methods borrow ideas from population-based heuristic
optimization, including Simulated Annealing, Evolutionary Optimization, and Particle Swarm
Optimization. NSGA-II [15] and AMOSA [4] are two well-studied population-based heuristic
algorithms designed for MOO. However, no theoretical convergence properties can be derived
under reasonable assumptions for these methods, and they are slow in practice due to the lack of
first-order principles. Other a posteriori methods update the iterate lists by applying steps of rig-
orous MOO algorithms designed for the computation of a single point in the Pareto front. Such
rigorous MOO algorithms have resulted from generalizing classical algorithms of single-objective
optimization to MOO.
As mentioned above, a number of rigorous algorithms have been developed for MOO by
extending single-objective optimization counterparts. A common feature of these MOO methods
is the attempt to move along a direction that simultaneously decreases all objective functions.
In most instances it is possible to prove convergence to a first-order stationary Pareto point.
Gradient descent is a first example of such a single-objective optimization technique that led to
the multi-gradient (or multiple gradient) method for MOO [24] (see also [21, 19, 17, 18]). As
analyzed in [25], it turns out that the multi-gradient method proposed by [24] shares the same
convergence rates as in the single objective case, for the various cases of nonconvex, convex, and
strongly convex assumptions. Other first-order derivative-based methods that were extended to
MOO include proximal methods [7], nonlinear conjugate gradient methods [40], and trust-region
methods [42, 49]. Newton’s method for multi-objective optimization, further using second-order
information, was first presented in [23] and later studied in [20]. For a complete survey on
multiple gradient-based methods see [27]. Even when derivatives of the objective functions are
not available for use, rigorous techniques were extended along the same lines from one to several
objectives, an example being the the so-called direct multi-search algorithm [13].
3
on the decision variable x and on the random variable/parameter w. The goal of stochastic
optimization is to seek a solution that optimizes the expectation of f taken with respect to the
random variable
min f (x) = E[f (x, w)], (2)
where w is a random variable defined in a probability space (with probability measure indepen-
dent from x), for which we assume that i.i.d. samples can be observed or generated. An example
of interest to us is classification in supervised machine learning, where one wants to build a pre-
dictor (defined by x) that maps features into labels (the features and labels can be seen as
realizations of w) by minimizing some form of misclassification. The objective function f (x)
in (2) is then called the expected risk (of misclassification), for which there is no explicit form
since pairs of features and labels are drawn according to a unknown distribution.
There are two widely-used approaches for solving problem (2), the sample average approxi-
mation (SAA) method and the stochastic approximation (SA) method. Given N i.i.d. samples
{wj }Nj=1 , one optimizes in SAA (see [35, 47]) an empirical approximation of the expected risk
N
1 X
min f N (x) = f (x, wj ). (3)
N
j=1
The SA method becomes an attractive approach in practice when the explicit form of the
gradient ∇f (x) for (2) is not accessible or the gradient ∇f N (x) for (3) is too expensive to
compute when N is large. The earliest prototypical SA algorithm, also known as stochastic
gradient (SG) algorithm, dates back to the paper [44]; and the classical convergence analysis
goes back to the works [12, 45]. In the context of solving (2), the SG algorithm is defined by
xk+1 = xk −αk ∇f (xk , wk ), where wk is a copy of w, ∇f (xk , wk ) is a stochastic gradient generated
according to wk , and αk is a positive stepsize. When solving problem (3), a realization of wk
may just be a random sample uniformly taken from {w1 , . . . , wN }. Computing the stochastic
gradient −∇f (xk , wk ) based on a single sample makes each iterate of the SG algorithm very
cheap. However, note that only the expectation of −∇f (xk , wk ) is descent for f at xk , and
therefore the performance of the SG algorithm is quite sensitive to the variance of the stochastic
gradient. A well-known idea to improve its performance is the P use of a batch gradient at each
iterate, namely updating each iterate by xk+1 = xk − |Sαkk | j∈Sk ∇f (xk , wj ), where Sk is a
minibatch sample from {1, . . . , N } of size |Sk |. More advanced variance reduction techniques
can be found in [16, 34, 41, 46] (see the review [8]).
4
vehicle routing problem [29, 39] for which one may want to minimize total transportation costs
and maximize customer satisfaction simultaneously given stochastic information about demand,
weather, and traffic condition. It is also possible that different objectives depend on indepen-
dent random information. In either case, we denote all the random information/scenarios by
the random variable w.
In the finite sum case (3) one has that E[fi (x, w)] is equal to or can be approximated by
N
1 X
fiN (x) = fi (x, wj ). (5)
N
j=1
Our work assumes that X does not involve uncertainty (see the survey [2]) for problems with
both stochastic objectives and constraints).
The main approaches for solving the SMOO problems are classified into two categories [1, 2,
9]: the multi-objective methods and the stochastic methods. The multi-objective methods first
reduce the SMOO problem into a deterministic MOO problem, and then solve it by techniques
for deterministic MOO (see Subsection 1.1). The stochastic methods first aggregate the SMOO
problem into a single objective stochastic problem and then apply single objective stochastic
optimization methods (see Subsection 1.2.1). Both approaches have disadvantages [9]. Note
that the stochastic objective functions fi , i = 1, . . . , m, may be correlated to each other as they
possibly involve the same set of random information given by w. Without taking this possibility
into consideration, the multi-objective methods might simplify the problem by converting each
stochastic objective to a deterministic counterpart independently of each other. As for the
stochastic methods, they obviously inherit the drawbacks of a priori scalarizarion methods for
deterministic MOO. We will nevertheless restrict our attention to multi-objective methods by
assuming that the random variables in the individual objectives are independent of each other.
5
effect introduces a bias in the overall estimation. A practical implementation and a theoretical
analysis of the method have necessarily to take into account the biasedness of the stochastic
multi-gradient.
In this paper we first study the bias of the stochastic multi-gradient direction and derive a
condition for the amount of biasedness that is tolerated to achieve convergence at the appropriate
rates. Such a condition will depend on the stepsize but can be enforced by increasing the batch
size used to estimate the individual gradients. Another aspect that introduces more complexity
in the MOO case is not knowing the limiting behavior of the approximate weights generated
by the algorithm when using sampled gradients, or even of the true weights if the subproblem
would be solved using the true gradients. In other words, this amounts to say that we do
not know which point in the Pareto front the algorithm is targeting at. We thus develop a
convergence analysis measuring the expected gap between S(xk , λk ) and S(x∗ , ak ), for various
possible selections of ak as approximations for λ∗ , where xk is the current iterate, λk are the
true weights, and x∗ is a Pareto optimal solution (with corresponding weights λ∗ ). The choice
ak = λ∗ requires however a stronger assumption, essentially saying that λk identifies well the
optimal role of λ∗ . Our convergence analysis shows that the stochastic multi-gradient algorithm
exhibits convergence rates similar
√ as in the single stochastic gradient method, i.e., O(1/k) for
strongly convexity and O(1/ k) for convexity.
The practical solution of many MOO problems requires however the calculation of the en-
tire Pareto front. Having such a goal in mind also for the stochastic MOO case, we propose a
Pareto-front multi-gradient stochastic (PF-SMG) method that iteratively updates a list of non-
dominated points by applying a certain number of steps of the stochastic multi-gradient method
at each point of the list. Such process generates a number of points which are then added to
the list. The main iteration is ended by removing possible dominated points from the list. We
tested our Pareto-front stochastic multi-gradient method, using synthesis MOO problems [13]
to which noise was artificially added, and then measured the quality of the approximated Pareto
fronts in terms of the so-called Purity and Spread metrics. The new algorithm shows satisfactory
performance when compared with a corresponding deterministic counterpart.
We have applied the Pareto-Front SMG algorithm to stochastic MOO problems arising from
supervised machine learning, in the setting of logistic binary classification where multiple ob-
jectives correspond to different sources of data within a set. The determination of the Pareto
front can help identifying classifiers that trade-off such sources or contexts, thus improving the
fairness of the classification process.
6
2 Pareto stationarity and common descent direction in the de-
terministic multi-objective case
The simplest descent method for solving smooth unconstrained MOO problems, i.e. problem (1)
with X = Rn , is the multi-gradient method proposed originally in [24] and further developed
in [17, 21]. Each iterate takes a step of the form xk+1 = xk + αk dk , where αk is a positive
stepsize and dk is a common descent direction at the current iteration xk .
A necessary condition for a point xk to be a (strict or nonstrict) Pareto minimizer of (1) is
that there does not exist any direction that is first-order descent for all the individual objectives,
i.e.,
range (∇JH (xk )) ∩ (−Rm ++ ) = ∅, (6)
where Rm ++ is the positive orthant cone and ∇JH (xk ) denotes the Jacobian matrix of H at xk .
Condition (6) characterizes first-order Pareto stationary. In fact, at such a nonstationary point
xk , there must exist a descent direction d ∈ Rn such that ∇hi (xk )> d < 0, i = 1, . . . , m, and one
could decrease all functions along d.
When m = 1 we simply take dk = −∇h1 (xk ) as the steepest descent or negative gradient
direction, and this amounts to minimize ∇h1 (xk )> d + (1/2)kdk2 in d. In MOO (m > 1), the
steepest common descent direction [24] is defined by minimizing the amount of first-order Pareto
stationarity, also in a regularized Euclidean sense,
1
(dk , βk ) ∈ argmin β + kdk2
d∈Rn ,β∈R 2 (7)
>
s.t. ∇hi (xk ) d − β ≤ 0, ∀i = 1, . . . , m.
If xk is first-order Pareto stationary, then (dk , βk ) = (0, 0) ∈ Rn+1 , and if not, ∇hi (xk )> dk ≤
βk < 0, for all i = 1, . . . , m (see [24]). The direction dk minimizes max1≤i≤m {∇hi (xk )> d} +
(1/2)kdk2 .
It turns out that the dual of (7) is the following subproblem
m 2
X
λk ∈ argmin λi ∇hi (xk )
λ∈Rm i=1
(8)
s.t. λ ∈ ∆m ,
where ∆m = {λ : m
P
i=1 λi = 1, λi ≥ 0, ∀i = 1, ..., m} denotes the simplex set. Subproblem (8)
reflects the fact that the common descent direction is pointing opposite to the minimum-norm
vector in the convex hull of the gradients ∇hi (xk ), i = 1, . . . , m. Hence, the
Pcommon descent
direction, called in this paper a negative multi-gradient, is written as dk = − m i=1 (λk )i ∇hi (xk ).
In the single objective case (m = 1), one recovers dk = −∇h1 (xk ). If xk is first-order Pareto
stationary, then the convex hull of the individual gradients contains the origin, i.e.,
m
X
m
∃λ ∈ ∆ such that λi ∇hi (xk ) = 0. (9)
i=1
When all the objective functions are convex, we have xk ∈ P if and only if xk is Pareto first-order
stationary [30, 37].
7
The multi-gradient algorithm [24] consists of taking xk+1 = xk + αk dk , where dk results
from the solution of any of the above subproblems and αk is a positive stepsize. The norm
of dk is a natural stopping criterion. Selecting αk either by backtracking until an appropriate
sufficient decrease condition is satisfied or by taking a fixed stepsize inversely proportional to
the maximum of the Lipschitz√constants of the gradients of the individual gradients leads to the
classical sublinear rates of 1/ k and 1/k in the nonconvex and convex cases, respectively, and
to a linear rate in the strongly convex case [25].
m 2
X
g
λ (xk , wk ) ∈ argmin λi gi (xk , wk )
λ∈Rm i=1
(10)
s.t. λ ∈ ∆m ,
where the convex combination coefficients λgk = λg (xk , wk ) depend on xk and on the random
variable wk . Let us denote the stochastic multi-gradient by
m
λgi (xk , wk )gi (xk , wk ).
X
g(xk , wk ) = (11)
i=1
Analogously to the unconstrained deterministic case, each iterative update of the SMG
algorithm takes the form xk+1 = xk −αk g(xk , wk ), where αk is a positive step size and g(xk , wk ) is
the stochastic multi-gradient. More generally, when considering a closed and convex constrained
set X different from Rn , we need to first orthogonally project xk − αk g(xk , wk ) onto X (such
projection is well defined and results from the solution of a convex optimization problem). The
SMG algorithm is described as follows.
As in the stochastic gradient method, there is also no good stopping criterion for the SMG
algorithm, and one may have just to impose a maximum number of iterations.
8
4 Biasedness of the stochastic multi-gradient
Figure 1 provides us the intuition for Subproblems (8) and (10) and their solutions when n =
m = 2. In this section for simplicity we will omit the index k. Let g11 and g21 be two unbiased
estimates of the true gradient ∇f1 (x) for the first objective function, and g12 and g22 be two
unbiased estimates of the true gradient ∇f2 (x) for the second objective function. Then, g1
and g2 , the stochastic multi-gradients from solving (10), are estimates of the true multi-gradient
g obtained from solving (8).
g21
g
g2 g1
g12
∇f1 (x)
g11
9
10-3 10-4
2.5 7
-5 -5
10 10
6
2
g
k
1.5
4
0 0
1 3
2000 2200 2400 2600 2800 3000 2000 2200 2400 2600 2800 3000
batch size [2000, 3000] batch size [2000, 3000]
2
0.5
0
0
-0.5 -1
0 100 200 300 400 500 0 100 200 300 400 500
batch size batch size
Biasedness is also present when we look at the norm of the expected error kEw [g(x, w)] −
∇x S(x, λ)k, using the true coefficients λ. In the same setting of the previous experiment, Figure 2
(b) shows that biasedness still exists, although in a smaller quantity than when using λg .
We will use Ewk [·] to denote the expected value taken with respect to wk . Notice that xk+1
is a random variable depending on wk whereas xk does not.
Now we propose our assumptions on the amount of biasedness and variance of the stochastic
multi-gradient g(xk , wk ). As commonly seen in the literature of the standard stochastic gradient
method [8, 38], we assume that the individual stochastic gradients gi (xk , wk ), i = 1, . . . , m, are
unbiased estimates of the corresponding true gradients and that their variance is bounded by
10
the size of these gradients (Assumptions (a) and (c) below). However, an assumption is also
needed to bound the amount of biasedness of the stochastic multi-gradient in terms of the true
gradient norm and the stepsize αk (Assumption (b) below).
Assumption 5.2 For all objective functions fi , i = 1, . . . , m, and iterates k ∈ N, the individual
stochastic gradients gi (xk , wk ) satisfy the following:
(a) (Unbiasedness) Ewk [gi (xk , wk )] = ∇fi (xk ).
(b) (Bound on the first moment) There exist positive scalars Ci > 0 and Ĉi > 0 such that
Ewk [kgi (xk , wk ) − ∇fi (xk )k] ≤ αk Ci + Ĉi k∇fi (xk )k . (13)
(c) (Bound on the second moment) There exist positive scalars Gi > 0 and Ĝi > 0 such
that
In fact, based on inequality (13), one can derive an upper bound for the biasedness of the
stochastic multi-gradient
Ewk [g(xk , wk ) − ∇x S(xk , λgk )] ≤ Ewk g(xk , wk ) − ∇x S(xk , λgk )
" m #
X g
= Ewk (λk )i (gi (xk , wk ) − ∇fi (xk ))
i=1
m
X
≤ Ewk [kgi (xk , wk ) − ∇fi (xk )k]
i=1
m m
!
X X
≤ αk Ci + Ĉi k∇fi (xk )k ,
i=1 i=1
where the first inequality results from Jensen’s inequality in the context of probability theory.
As a consequence, we have
m
!
g
X
kEwk [g(xk , wk ) − ∇x S(xk , λk )]k ≤ αk M1 + MF k∇fi (xk )k (14)
i=1
Pm
with M1 = i=1 Ci and MF = max1≤i≤m Ĉi . Note that we could have imposed directly the
assumption
kEwk [g(xk , wk ) − ∇x S(xk , λgk )]k ≤ αk M1 + MF Ewk [∇x S(xk , λgk )]
,
from which then (14) would have easily followed. However we will see later that we will also
need the more general version stated in Assumption 5.2 (b).
Using Assumptions 5.2 (a) and (c), we can generalize the bound on the variance of the
individual stochastic gradients gi (xk , wk ) to the stochastic multi-gradient g(xk , wk ). In fact we
first note that
2
Ewk [kgi (xk , wk )k2 ] = Vwk [gi (xk , wk )] + Ewk [gi (xk , wk )]
≤ G2i + (Ĝ2i + 1)k∇fi (xk )k2 ,
11
from which we then obtain
m 2
(λgk )i gi (xk , wk )
X
Ewk [kg(xk , wk )k2 ] = Ewk
i=1
m
" #
X
≤ Ewk m kgi (xk , wk )k2
i=1
m
X
≤m G2i + (Ĝ2i + 1)k∇fi (xk )k2
i=1
m
X
= G2 + G2V k∇fi (xk )k2
i=1
with G2 = m m 2 2 2
P
i=1 Gi and GV = m max1≤i≤m (Ĝi + 1). Note that the obtained inequality is
consistent with imposing directly a bound of the form
2
Ewk kg(xk , wk )k2 ≤ G2 + G2V Ewk [∇x S(xk , λgk )]
The above assumption implies the existence of an upper bound on the diameter of the feasible
region, i.e., there exists a positive constant Θ such that
Note that from Assumption 5.1 and (15), the norm of the true gradient of each objective function
is bounded, i.e., k∇fi (x)k ≤ M∇ + LΘ, for i = 1, . . . , m, and any x ∈ X , where M∇ denotes
the largest of the norms of the ∇fi at an arbitrary point of X . For conciseness, denote L∇S =
M1 + mMF (M∇ + LΘ) and L2g = G2 + mG2V (M∇ + LΘ)2 . Hence, we have
and
Ewk kg(xk , wk )k2 ≤ L2g .
(17)
Lastly, we need to bound the sensitivity of the solution of the Subproblem (8), a result that
follows locally from classical sensitivity theory but that we assume globally too.
Assumption 5.4 (Subproblem Lipschitz continuity) The optimal solution of Subproblem (8)
is a Lipschitz continuous function of the parameters {∇fi (x), 1 ≤ i ≤ m}, i.e., there exists a
scalar β > 0 such that
h i
kλk − λs k ≤ β (∇f1 (xk ) − ∇f1 (xs ))> , . . . , (∇fm (xk ) − ∇fm (xs ))> .
12
The above assumption states that the function λ = λ(v1 , . . . , vm ), where vi ∈ Rn , i =
1, . . . , m, is Lipschitz continuous on (v1 , . . . , vm ). Recall that the only change from Subprob-
lem (8) to (10) is replacing the true gradients by the corresponding stochastic gradients. As a
consequence, the optimal solutions of Subproblems (8) and (10) satisfy
h i
Ewk [kλgk − λk k] ≤ βEwk [(g1 (xk , wk ) − ∇f1 (xk ))> , . . . , (gm (xk , wk ) − ∇fm (xk ))> ]
m
X
(18)
≤β Ewk kgi (xk , wk ) − ∇fi (xk )k
i=1
≤ αk (βL∇S ),
where L∇S is the constant defined in (16). Since ∇x S(x, λ) is a linear function of λ,
∇x S(xk , λgk ) − ∇x S(xk , λk ) ≤ MS λgk − λk ,
√
with MS = mn(M∇ + LΘ). By taking expectation over wk and using (18), one obtains
Ewk ∇x S(xk , λgk ) − ∇x S(xk , λk )
≤ αk (βL∇S MS ). (19)
Theorem 5.1 (sublinear convergence rate under strong convexity) Let Assumptions 5.1–
2
5.5 hold and x∗ be any point in X . Consider a diminishing step size sequence αk = c(k+1) . The
sequence of iterates generated by Algorithm 1 satisfies
2L2g + 4Θ(L∇S + βL∇S MS )
min E[S(xs , λs )] − E[S(x∗ , λ̄k )] ≤ ,
s=1,...,k c(k + 1)
Pk s
where λ̄k = s=1 Pk s λs ∈ ∆m .
s=1
13
Proof. For any k ∈ N, considering that the projection operation is non-expansive, one can write
Ewk [kxk+1 − x∗ k2 ] = Ewk [kPX (xk − αk g(xk , wk )) − x∗ k2 ]
≤ Ewk [kxk − αk g(xk , wk ) − x∗ k2 ]
(21)
= kxk − x∗ k2 + αk2 Ewk [kg(xk , wk )k2 ]
− 2αk Ewk [g(xk , wk )]> (xk − x∗ ).
Adding the null term 2αk (Ewk [∇x S(xk , λgk )] − Ewk [∇x S(xk , λgk )] + ∇x S(xk , λk ) − ∇x S(xk , λk ))
to the right-hand side yields
Ewk [kxk+1 − x∗ k2 ] ≤ kxk − x∗ k2 + αk2 Ewk [kg(xk , wk )k2 ] − 2αk ∇x S(xk , λk )> (xk − x∗ )
+ 2αk kEwk [g(xk , wk ) − ∇x S(xk , λgk )]kkxk − x∗ k (22)
+ 2αk kEwk [∇x S(xk , λgk ) − ∇x S(xk , λk )]kkxk − x∗ k.
Choosing λ = λk , x = xk , and x̄ = x∗ in inequality (20), one has
c
∇x S(xk , λk )> (xk − x∗ ) ≥ S(xk , λk ) − S(x∗ , λk ) + kxk − x∗ k2 . (23)
2
Then, plugging inequalities (16), (17), (19), and (23) into (22), we obtain
Ewk [kxk+1 − x∗ k2 ] ≤ (1 − αk c)kxk − x∗ k2 + αk2 (L2g + 2Θ(L∇S + βL∇S MS ))
− 2αk Ewk [S(xk , λk ) − S(x∗ , λk )].
2
For simplicity denote M = L2g + 2Θ(L∇S + βL∇S MS ). Using αk = c(k+1) , and rearranging
the last inequality,
(1 − αk c)kxk − x∗ k2 + αk2 M − Ewk [kxk+1 − x∗ k2 ]
Ewk [S(xk , λk ) − S(x∗ , λk )] ≤
2αk
c(k − 1) c(k + 1) M
≤ kxk − x∗ k2 − Ewk [kxk+1 − x∗ k2 ] + .
4 4 c(k + 1)
Now we replace k by s in the above inequality. Taking the total expectation, multiplying by s
on both sides, and summing over s = 1, . . . , k yields
k k
X X cs(s − 1) 2 cs(s + 1) 2
s(E[S(xs , λs )] − E[S(x∗ , λs )]) ≤ E[kxs − x∗ k ] − E[kxs+1 − x∗ k ]
4 4
s=1 s=1
k
X s
+ M
c(s + 1)
s=1
k
c X s
≤ − k(k + 1)E[kxk+1 − x∗ k2 ] + M
4 c(s + 1)
s=1
k
≤ M.
c
Dividing both sides of the last inequality by ks=1 s gives us
P
Pk Pk
s=1 sE[S(xs , λs )] − s=1 sE[S(x∗ , λs )] kM 2M
Pk ≤ Pk ≤ . (24)
s=1 s c s=1 s c(k + 1)
14
The left-hand side is taken care as follows
k k
X s X s
min E[S(xs , λs )] − E[S(x∗ , λ̄k )] ≤ Pk E[S(xs , λs )] − Pk E[S(x∗ , λs )], (25)
s=1 s s=1 s
s=1,...,k
s=1 s=1
Pk s
where λ̄k = s=1 Pk s λs . The proof is finally completed by combining (24) and (25).
s=1
Since the sequence {λk }k∈N generated by Algorithm 1 is bounded, it has a limit point λ∗ .
Assume that the whole sequence {λk }k∈N converges to λ∗ . Let x∗ be the unique minimizer of
S(x, λ∗ ). Then, x∗ is a Pareto minimizer associated with λ∗ . Since λ̄k is also converging to λ∗ ,
E[S(x∗ , λ̄k )] converges to E[S(x∗ , λ∗ )]. Hence, Theorem 5.1 states that min1≤s≤k E[S(xs , λs )]
converges to E[S(x∗ , λ∗ )]. The result of Theorem 5.1 indicates that the approximate rate of
such convergence is 1/k. Rigorously speaking, since we do not have λ∗ on the left-hand side but
rather λ̄k , such left-hand side is not even guaranteed to be positive. The difficulty comes from
the fact that λ∗ is only defined at convergence and the multi-gradient method cannot anticipate
which optimal weights are being approached, or in equivalent words which weighted function is
being minimized at the end. Such a difficulty is resolved if we assume that λk approximates well
the role of λ∗ at the Pareto front.
Assumption 5.6 Let x∗ be the Pareto minimizer defined above. For any xk , one has
∇x S(x∗ , λk )> (xk − x∗ ) ≥ 0.
In fact notice that ∇x S(x∗ , λ∗ ) = 0 holds according to the Pareto stationarity condition (9),
and thus this assumption would hold with λk replaced by λ∗ .
A well-known equivalent condition to (20) is
(∇x S(x, λ) − ∇x S(x̄, λ))> (x − x̄) ≥ ckx − x̄k2 , ∀(x, x̄) ∈ Rn × Rn .
Choosing x = xk , x̄ = x∗ , and λ = λk in the above inequality and using Assumption 5.6 leads
to
∇x S(xk , λk )> (xk − x∗ ) ≥ ckxk − x∗ k2 , (26)
based on which one can derive a stronger convergence result2 .
Theorem 5.2 Let Assumptions 5.1–5.6 hold and x∗ be the Pareto minimizer corresponding to
the limit point λ∗ of the sequence {λk }. Consider a diminishing step size sequence αk = γk where
1
γ ≥ 2c is a positive constant. The sequence of iterates generated by Algorithm 1 satisfies
max{2γ 2 M̄ 2 (2cγ − 1)−1 , kx0 − x∗ k2 }
E[kxk − x∗ k2 ] ≤ ,
k
and
(L/2) max{2γ 2 M̄ (2cγ − 1)−1 , kx0 − x∗ k2 }
E[S(xk , λ∗ )] − E[S(x∗ , λ∗ )] ≤
k
where M̄ = L2g + 2Θ(L∇S + βL∇S MS ).
2
Let us see how Assumption 5.6 relates to Assumption H5 used in [43]. These authors have made the strong
assumption that the noisy values satisfy a.s. fi (x, w) − fi (x, w) ≥ Ci kx − x⊥ k2 for all x, where x⊥ is the point in
P closest to x (and Ci a positive constant). From here they easily deduce from the convexity of the individual
functions fi that Ewk [g(xk , wk )]> (xk − x⊥ ⊥ 2
k ) ≥ 0, which then leads to establishing that E[kxk − xk k ] = O(1/k).
Notice that Ewk [g(xk , wk )] (xk − xk ) ≥ 0 would also result from (26) (with x∗ replaced by x⊥
> ⊥
k ) if g(xk , wk )
was an unbiased estimator of ∇x S(xk , λk ).
15
Proof. Similarly to the proof of Theorem 5.1, from (21) to (22), but using (26) instead of (23),
one has
Using αk = γ/k with γ > 1/(2c) and an induction argument (see [38, Eq. (2.9) and (2.10)])
would lead us to
max{2γ 2 M̄ (2cγ − 1)−1 , kx0 − x∗ k2 }
E[kxk − x∗ k2 ] ≤ .
k
Finally, from an expansion using the Lipschitz continuity of ∇x S(·, λ∗ ) (see (12)), one can also
derive a sublinear rate in terms of the optimality gap of the weighted function value
L
E[S(xk , λ∗ )] − E[S(x∗ , λ∗ )] ≤ E[∇x S(x∗ , λ∗ )]> (xk − x∗ ) + E[kxk − x∗ k2 ]
2
(L/2) max{2γ 2 M̄ (2cγ − 1)−1 , kx0 − x∗ k2 }
≤ .
k
Assumption 5.7 All the objective functions fi : Rn → R are convex, i = 1, . . . , m. The convex
function S(·, λ) attains a minimizer for any λ ∈ ∆m .
Theorem 5.3 (sublinear convergence rate under convexity) Let Assumptions 5.1–5.4 and
5.7 hold and x∗ be any point in X . Consider a diminishing step size sequence αk = √ᾱk where ᾱ
is any positive constant. The sequence of iterates generated by Algorithm 1 satisfies
Θ2
2ᾱ + ᾱ(L2g + 2Θ(L∇S + βL∇S MS ))
min E[S(xs , λs )] − E[S(x∗ , λ̄k )] ≤ √ ,
s=1,...,k k
1 Pk
where λ̄k = k s=1 λs ∈ ∆m .
16
5.3 Imposing a bound on the biasedness of the multi-gradient
Recall that from
m
X
kEw [g(x, w) − ∇x S(x, λg )]k ≤ Ew [kgi (x, w) − ∇fi (x)k] , (27)
i=1
where gi (x, w) is the stochastic gradient at x for the i-th objective function, and from Assump-
tion 5.2 (b), we derived a more general bound for the biasedness of the stochastic multi-gradient
in (14), whose right-hand side involves the stepsize αk . For simplicity, we will again omit the
index k in the subsequent analysis.
We will see that the right-hand side of (27) can be always (approximately) bounded by a
dynamic sampling strategy when calculating the stochastic gradients for each objective function.
The idea is similar to mini-batch stochastic gradient, in the sense that by increasing the batch
size the noise is reduced and thus more accurate gradient estimates are obtained.
Assumption 5.2 (a) states that gi (x, w), i = 1, . . . , m, are unbiased estimates of the corre-
sponding true gradients. Let us assume that gi (x, w) is normally distributed with mean ∇fi (x)
and variance σi2 , i.e., gi (x, w) ∼ N (∇fi (x), σi2 In ), where n is the dimension of x. For each objec-
tive function, one can obtain a more accurate stochastic gradient estimate by increasing P i the batch
size. Let bi be the batch size for the i-th objective function and ḡi (x, w) = b1i br=1 gi (x, wr )
be the corresponding batch stochastic gradient, where {wr }1≤r≤bi are realizations of w. Then,
Gi = gi (x, w) − ∇fi (x) and Ḡi = ḡi (x, w) − ∇fi (x), i = 1, . . . , m, are all random variables of
mean 0. The relationship between Gi and Ḡi is captured by (see [26])
Vw [Gi ] σ2
Vw [Ḡi ] ≤ ≤ i.
bi bi
By the definition of variance Vw [Gi ] = Ew [kGi k2 ] − kEw [Gi ]k2 , kEw [Ḡi ]k ≤ Ew [kḠi k], and
Ew [Ḡi ] = 0, one has Vw [kḠi k] ≤ Vw [Ḡi ] ≤ σi2 /bi . Then, replacing gi (x, w) in (27) by ḡi (x, w),
we have √
m m
g
X X σi n
kEw [ḡ(x, w) − ∇x S(x, λ )]k ≤ Ew kḠi k ≤ √ ,
i=1 i=1
bi
√
where the last inequality results from E[kXk] ≤ σ n for the random variable X ∼ N (0, σ 2 In ) [10].
Pm σi √n
Hence, one could enforce an inequality of the form i=1 √b ≤ αk (M1 + MF m
P
i i=1 k∇fi (x)k)
to guarantee that (14) holds (of course replacing the size of the true gradients by some posi-
tive constant).√Furthermore, to guarantee that the stronger bound (13) holds, one can require
Ew [kḠi k] ≤ σ√i b n ≤ α(C1 + Ĉi k∇fi (x)k) for each objective function. Intuitively, when smaller
i
stepsizes are taken, the sample sizes {bi }1≤i≤m should be increased or, correspondingly, smaller
sample variances {σi }1≤i≤m are required.
17
an algorithm is to iteratively update a list of nondominated points which will render increasingly
better approximations to the true Pareto front. The list is updated by essentially applying the
SMG algorithm at some or all of its current points. The PF-SMG algorithm begins with a list
of (possibly random) starting points L0 . At each iteration, before applying SMG and for sake
of better performance, we first add to the list a certain number of perturbed points around
each of the current ones. Then we apply a certain number of SMG steps multiple times at each
point in the current list, adding each resulting final point to the list. Note that by applying
SMG multiple times starting from the very same point, one obtains different final points due to
stochasticity. The iteration is finished by removing all dominated points from the list. Since the
new list Lk+1 is obtained by removing dominated points from Lk ∪ Lnew k , where Lk
new is the set
of new points added to the current nondominated list Lk , we only need to compare each point
in Lnew
k to the other points in Lk ∪ Lnewk in order to remove any dominated points [13]. The
PF-SMG algorithm is formally described as follows.
In order to evaluate the performance of the PF-SMG algorithm and have a good benchmark
for comparison, we also introduce a Pareto-front version of the deterministic multi-gradient
algorithm (acronym PF-MG). The PF-MG algorithm is exactly the same as the PF-SMG one
except that one applies q steps of multi-gradient descent instead of stochastic multi-gradient to
each point in Line 9. Also, p is always equal to one in PF-MG.
7 Numerical experiments
7.1 Parameter settings and metrics for comparison
In our implementation, both PF-SMG and PF-MG algorithms use the same 30 randomly gen-
erated starting points, i.e., |L0 | = 30. In both cases we set q = 2. The step size is initialized
differently according to the problem but always halved every 200 iterations. Both algorithms
are terminated when either the number of iterations exceeds 1000 or the number of points in
18
the iterate list reaches 1500.
To avoid the size of the list growing too fast, we only generate the r perturbed points for pairs
of points corresponding to the m largest holes along the axes fi , i = 1, . . . , m. More specifically,
given the current list of nondominated points, their function values in terms of fi are first sorted
in an increasing order, for i = 1, . . . , m. Let dj,j+1
i be the distance between points j and j + 1
in fi . Then, the pair of points corresponding to the largest hole along the axis fi is (ji , ji + 1),
where ji = argmaxj dj,j+1
i .
Given the fact that applying the SMG algorithm multiple times to the same point results in
different output points, whereas this is not the case for multi-gradient descent, we take p = 2 for
PF-SMG but let p = 1 for PF-MG. Then, we choose r = 5 for PF-SMG and r = 10 for PF-MG,
such that the number of new points added to the list is the same at each iteration of the two
algorithms.
To analyze the numerical results, we consider two types of widely-used metrics to measure
and compare Pareto fronts obtained from different algorithms, Purity [3] and Spread [15], whose
mathematical formula are briefly recalled in Appendix B. In what concerns the Spread metric,
we use two variants, maximum size of the holes and point spread, respectively denoted by Γ
and ∆.
In our context, we evaluate the prediction loss using the smooth convex logistic function
l(a, y; x, b) = log(1+exp(−y(x> a+b))), which leads us to a well-studied
P convex objective, i.e., lo-
gistic regression problem with the objective function being minx,b N1 Nj=1 log(1+exp(−yj (x >a +
j
19
b))), where N is the size of training data. To avoid over-fitting, we need to add a regularization
term λ2 kxk2 to the objective function.
For the purpose of our study, we pick a feature of binary values and separate the given data
set into two groups, with J1 and J2 as their index sets. An appropriate two-objective problem
is formulated as minx,b (f1 (x, b), f2 (x, b)), where
1 X > λi
fi (x, b) = log(1 + e(−yj (x aj +b)) ) + kxk2 . (28)
|Ji | 2
j∈Ji
One can easily observe that there exist obvious differences in term of the training accuracy
between the two groups of data sets heart and svmguide3, whereas australian and german.numer
have much smaller gaps. This means that classifying a new instance using the minimizer obtained
from the single SG for the whole group might lead to large bias and poor accuracy. We then
constructed a two-objective problem (28) with λ1 = λ2 = 10−3 for the two groups of each data
set. The PF-SMG algorithm has yielded the four Pareto fronts displayed in Figure 3.
The wider Pareto fronts of data sets heart and svmguide3 coherently indicate higher dis-
tinction between their two groups. Table 2 presents the number of iterations and the size of
Pareto front solutions when the PF-SMG algorithm is terminated. To illustrate the trade-offs,
five representative points are selected from the obtained Pareto front, and the corresponding
training accuracy are evaluated for the two groups separately. The CPU times for computing
these Pareto fronts using PF-SMG are also included in the table (all the experiments were ran
on a MacBook Pro 2019 with 2.4 GHz Quad-Core Intel Core i5). For comparison, we ran the
PF-MG algorithm using full batch gradients at each iteration. Since the computation of full
batch gradients is more expensive, the CPU times of PF-MG to obtain approximately the same
number of non-dominated points are 0.94, 2.5, 1.3, and 2.5 times those of PF-SMG, respectively.
For the dataset heart, PF-MG takes slightly less time due to a smaller data size (using more
accurate gradient results in higher convergence speed and more accurate Pareto fronts). A more
systematic comparison between PF-MG and PF-SMG in terms of spread and purity is given in
Section 7.3.
20
0.65
0.42
0.60
0.40
0.55
0.38
f2(x)
f2(x)
0.50
0.36
0.45
0.40 0.34
0.175 0.200 0.225 0.250 0.275 0.300 0.325 0.350 0.23 0.24 0.25 0.26 0.27 0.28
f1(x) f1(x)
0.49
1.4
0.48
1.2
0.47
1.0
f2(x)
f2(x)
0.46
0.8
0.45
0.6 0.44
0.43
0.4 0.6 0.8 1.0 1.2 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56
f1(x) f1(x)
Figure 3: The approximated Pareto fronts for the logistic regression problems: (a) heart; (b)
australian; (c) svmguide3 ; (d) german.numer.
It is observed for the groups of data sets heart and svmguide3 that the differences of train-
ing accuracy vary more than 10% among Pareto minimizers. Two important implications from
the results are: (1) Given several groups of data instances for the same problem, one can eval-
uate their biases by observing the range of an approximated Pareto front; (2) The resulting
well-approximated Pareto front provides the decision-maker a complete trade-off of prediction
21
accuracy across different groups. Recall that by the definition of overall accuracy equality, the
larger the accuracy disparity among different groups, the higher is the bias or unfairness. To
achieve a certain level of fairness in predicting new data instances, one may select a nondomi-
nated solution with a certain amount of accuracy disparity.
The way to construct a corresponding stochastic MOO problem from its deterministic MOO
problem was the same as in [43]. For each of these MOO test problems, we added random noise
to its variables to obtain a stochastic MOO problem, i.e.,
where w is uniformly distributed with mean zero and interval length being 1/10 of the length of
the simple bound interval (the latter one was artificially chosen when not given in the problem
description). Note that the stochastic gradients will not be unbiased estimates of the true
gradients of each objective function, but rather gradients of randomly perturbed points in the
neighborhood of the current point.
Figure 4 illustrates four different geometry shapes of Pareto fronts obtained by removing all
dominated points from the union of the resulting Pareto fronts obtained from the application
22
(a) SP1 (b) FF1
Figure 4: Different geometry shapes of Pareto fronts: (a) Convex; (b) Concave; (c) Mixed
(neither convex nor concave); (d) Disconnected.
of the PF-SMG and PF-MG algorithms. In the next subsection, the quality of approximated
Pareto fronts obtained from the two algorithms is measured and compared in terms of the Purity
and Spread metrics.
23
Purity: PF-SMG versus PF-MG
1
0.8
0.6
0.4
0.2
PF-SMG
PF-MG
0
1 1.1 1.2 1.3 1.4 1.5
(a) Purity
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
PF-SMG PF-SMG
PF-MG PF-MG
0 0
2 4 6 8 10 12 1 1.2 1.4 1.6 1.8 2 2.2
Overall, the PF-MG algorithm produces Pareto fronts of higher Purity than the PF-SMG,
which is reasonable since using the accurate gradient information results in points closer to the
true Pareto front. However, the Purity of Pareto fronts resulting from the PF-SMG is quite
close to the one from the PF-MG in most of the testing problems. Also, when we examine the
quality of the fronts in terms of the Spread metrics (see Γ and ∆ in Table 4), their performances
are comparable, which indicates that the proposed PF-SMG algorithm is able to produce well-
spread Pareto fronts. For some problems like IM1 and FF1, it is observed that PF-SMG generates
nondominated points faster than the PF-MG. This might be due to the fact PF-SMG has two
sources of stochasticity, both in generating the points and in applying stochastic multi-gradient,
whereas PF-MG is only stochastic in the generation of points.
On the other hand, perhaps due to the worse accuracy of stochastic multi-gradients, PF-SMG
takes more iterations than PF-MG to achieve the same tolerance level. Nevertheless, suppose
that the computational cost for computing the true gradients for each objective function is
significantly higher than the one for obtaining the stochastic gradients. It is easy to consider
scenarios when the computation cost of PF-MG would be far more expensive than for PF-SMG.
24
Problem Algorithm Purity Γ ∆ # Iter |Lk |
PF-MG 1.000 0.0332 1.4404 26 1575
ZDT1
PF-SMG 1.000 0.0666 1.6958 26 1789
PF-MG 1.000 0.9336 1.0407 48 1524
ZDT2
PF-SMG 1.000 0.0705 1.5637 32 1680
PF-MG 0.999 0.1716 1.5941 84 1524
ZDT3
PF-SMG 0.999 0.6539 1.3005 70 1544
PF-MG 1.000 0.1853 1.3520 24 1530
JOS2
PF-SMG 1.000 0.7358 1.5445 18 2271
PF-MG 0.996 0.0763 1.5419 24 1826
SP1
PF-SMG 0.880 0.2817 0.9742 102 1503
PF-MG 0.992 0.0936 0.8879 18 1581
IM1
PF-SMG 0.973 0.2591 1.0613 16 2161
PF-MG 0.982 0.0788 1.5637 46 1533
FF1
PF-SMG 0.630 0.0671 1.5701 20 1834
PF-MG 0.843 0.3800 1.5072 26 1741
Far1
PF-SMG 0.958 0.4192 1.5996 44 1602
PF-MG 1.000 24.6399 1.0053 68 1531
SK1
PF-SMG 0.999 24.6196 0.9195 48 1614
PF-MG 1.000 0.0329 0.9003 78 1505
MOP1
PF-SMG 1.000 0.1091 0.9462 14 2036
PF-MG 1.000 0.0614 1.8819 140 1527
MOP2
PF-SMG 0.841 0.0609 0.8057 124 1504
PF-MG 0.990 19.8772 1.7938 26 1530
MOP3
PF-SMG 0.863 19.8667 1.7664 50 1571
PF-MG 0.953 26.8489 1.8430 14 1813
DEB41
PF-SMG 0.920 18.8147 1.5101 18 1997
Table 4: Comparison between resulting Pareto fronts from the PF-MG and PF-SMG algorithms.
Two informative final notes. The Pareto fronts of problem SK1 and MOP3 are disconnected,
and hence, their values of Γ are significantly larger than others. There exists a conflict between
depth (Purity) and breadth (Spread) of the Pareto front. One can always tune some parameters,
e.g., the number of starting points and the number of points generated per point at each iteration,
to balance the Purity and Spread of the resulting Pareto fronts.
8 Conclusions
The stochastic multi-gradient (SMG) method is an extension of the stochastic gradient method
from single to multi-objective optimization (MOO). However, even based on the assumption
of unbiasedness of the stochastic gradients of the individual functions, it has been observed in
this paper that there exists a bias between the stochastic multi-gradient and the corresponding
true multi-gradient, essentially due to the composition with the solution of a quadratic program
(see (10)). Imposing a condition on the amount of tolerated√biasedness, we established sublinear
convergence rates, O(1/k) for strongly convex and O(1/ k) for convex objective functions,
25
similar to what is known for single-objective optimization, except that the optimality gap was
measured in terms of a weighted sum of the individual functions. We realized that the main
difficulty in establishing these rates for the multi-gradient method came from the unknown
limiting behavior of the weights generated by the algorithm. Nonetheless, our theoretical results
contribute to a deeper understanding of the convergence rate theory of the classical stochastic
gradient method in the MOO setting.
To generate an approximation of the entire Pareto front in a single run, the SMG algorithm
was framed into a Pareto-front one, iteratively updating a list of nondominated points. The
resulting PF-SMG algorithm was shown to be a robust technique for smooth stochastic MOO
since it has produced well-spread and sufficiently accurate Pareto fronts, while being relatively
efficient in terms of the overall computational cost. Our numerical experiments on binary logistic
regression problems showed that solving a well-formulated MOO problem can be a novel tool
for identifying biases among potentially different sources of data and improving the prediction
fairness.
As it is well known, noise reduction [16, 34, 41, 46] was studied intensively during the last
decade to improve the performance of the stochastic gradient method. Hence, a relevant topic for
our future research is the study of noise reduction in the setting of the stochastic multi-gradient
method for MOO. More applications and variants of the algorithm can be further explored. For
example, we have not yet tried to solve stochastic MOO problems when the feasible region is
different from box constraints. We could also consider the incorporation of a proximal term
and in doing so we could handle nonsmooth regularizers. Other models arising in supervised
machine learning, such as the deep learning, could be also framed into an MOO context. Given
that the neural networks used in deep learning give rise to nonconvex objective functions, we
would also be interested in developing the convergence rate theory for the SMG algorithm in
the nonconvex case.
For simplicity denote M̂ = L2g + 2Θ(L∇S + βL∇S MS ). Dividing both sides by αk and taking
total expectations on both sides allow us to write
E[kxk − x∗ k2 ] − E[kxk+1 − x∗ k2 ]
2(E[S(xk , λk )] − E[S(x∗ , λk )]) ≤ + αk M̂ .
αk
26
Replacing k by s in the above inequality and summing over s = 1, . . . , k lead to
k k
X 1 X
2 (E[S(xs , λs )] − E[S(x∗ , λs )]) ≤ E[kx1 − x∗ k2 ] + αs M̂
α1
s=1 s=1
k
X 1 1
+ ( − )E[kxs − x∗ k2 ]
αs αs−1
s=2
k k
Θ2 X 1 1 X
≤ + ( − )Θ2 + αs M̂
α1 αs αs−1
s=2 s=1
k
Θ2 X
≤ + αs M̂ .
αk
s=1
ᾱ
Then, using αs = √
s
and dividing both sides by 2k in the last inequality give us
k
1X Θ2 ᾱM̂
(E[S(xs , λs )] − E[S(x∗ , λs )]) ≤ √ + √ , (31)
k 2ᾱ k k
s=1
Pk ᾱ
√
from the fact s=1
√
s
≤ 2ᾱ k. In the left-hand side, one can use the following inequality
k
1X
min E[S(xk , λk )] − E[S(x∗ , λ̄k )] ≤ (E[S(xs , λs )] − E[S(x∗ , λs )]), (32)
s=1,...,k k
s=1
1 Pk
where λ̄k = k s=1 λs . The final result follows from combining (31) and (32).
27
The first Spread formula calculates the maximum size of the holes for a Pareto front. Assume
algorithm a generates an approximated Pareto front with M points, indexed by 1, . . . , M , to
which the extreme points H(xkmin ),H(xkmax ) indexed by 0 and M + 1 are added. Denote the
maximum size of the holes by Γ. We have
Γ = Γa,t = max max {δi,j } ,
i∈{1,...,m} j∈{1,...,M }
where δi,j = hi,j+1 − hi,j , and we assume each of the objective function values hi is sorted in an
increasing order.
The second formula was proposed by [15] for the case m = 2 (and further extended to the
case m ≥ 2 in [13]) and indicates how well the points are distributed in a Pareto front. Denote
the point spread by ∆. It is computed by the following formula:
P −1
δi,0 + δi,M + M
!
j=1 |δi,j − δ̄i |
∆ = ∆a,t = max ,
i∈{1,...,m} δi,0 + δi,M + (M − 1)δ̄i
where δ̄i , i = 1, . . . , m is the average of δi,j over j = 1, . . . , M − 1. Note that the lower Γ and ∆
are, the more well distributed the Pareto front is.
References
[1] F. B. Abdelaziz. L’efficacité en Programmation Multi-Objectifs Stochastique. PhD thesis, Université
de Laval, Québec, 1992.
[2] F. B. Abdelaziz. Solution approaches for the multiobjective stochastic programming. European J.
Oper. Res., 216:1–16, 2012.
[3] S. Bandyopadhya, S. K. Pal, and B. Aruna. Multiobjective GAs, quantitative indices, and pattern
classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34:2088–
2099, 2004.
[4] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb. A simulated annealing-based multiobjective
optimization algorithm: AMOSA. IEEE Transactions on Evolutionary Computation, 12:269–283,
2008.
[5] S. Barocas, M. Hardt, and A. Narayanan. Fairness in machine learning. NIPS Tutorial, 1, 2017.
[6] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice risk assessments:
The state of the art. Sociological Methods & Research, pages 1–42, 2018.
[7] H. Bonnel, A. N. Iusem, and B. F. Svaiter. Proximal methods in vector optimization. SIAM J.
Optim., 15:953–970, 2005.
[8] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.
SIAM Rev., 60:223–311, 2018.
[9] R. Caballero, E. Cerdá, M. Munoz, and L. Rey. Stochastic approach versus multiobjective approach
for obtaining efficient solutions in stochastic multiobjective programming problems. European J.
Oper. Res., 158:633–648, 2004.
[10] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse
problems. Found. Comput. Math., 12:805–849, 2012.
28
[11] C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on
Intelligent Systems and Technology (TIST), 2:27, 2011.
[12] K. L. Chung. On a stochastic approximation method. Ann. Math. Statist., 25:463–483, 1954.
[13] A. L. Custódio, J. A. Madeira, A. I. F. Vaz, and L. N. Vicente. Direct multisearch for multiobjective
optimization. SIAM J. Optim., 21:1109–1140, 2011.
[14] I. Das and J. E. Dennis. Normal-boundary intersection: A new method for generating the Pareto
surface in nonlinear multicriteria optimization problems. SIAM J. Optim., 8:631–657, 1998.
[15] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm:
NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182–197, 2002.
[16] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support
for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems,
pages 1646–1654, 2014.
[17] J. A. Désidéri. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C. R.
Math. Acad. Sci. Paris, 350:313–318, 2012.
[18] J. A. Désidéri. Multiple-gradient descent algorithm for Pareto-front identification. In Modeling,
Simulation and Optimization for Science and Technology, pages 41–58. Springer, Dordrecht, 2014.
[19] L. G. Drummond and A. N. Iusem. A projected gradient method for vector optimization problems.
Comput. Optim. Appl., 28:5–29, 2004.
[20] L. G. Drummond, F. M. P. Raupp, and B. F. Svaiter. A quadratically convergent Newton method
for vector optimization. Optimization, 63:661–677, 2014.
[21] L. G. Drummond and B. F. Svaiter. A steepest descent method for vector optimization. J. Comput.
Appl. Math., 175:395–414, 2005.
[22] M. Ehrgott. Multicriteria Optimization, volume 491. Springer Science & Business Media, Berlin,
2005.
[23] J. Fliege, L. G. Drummond, and B. F. Svaiter. Newton’s method for multiobjective optimization.
SIAM J. Optim., 20:602–626, 2009.
[24] J. Fliege and B. F. Svaiter. Steepest descent methods for multicriteria optimization. Math. Methods
Oper. Res., 51:479–494, 2000.
[25] J. Fliege, A. I. F. Vaz, and L. N. Vicente. Complexity of gradient descent for multiobjective opti-
mization. to appear in Optim. Methods Softw., 2018.
[26] J. E. Freund. Mathematical Statistics. Prentice-Hall, Englewood Cliffs, N.J., 1962.
[27] E. H. Fukuda and L. M. G. Drummond. A survey on multiobjective descent methods. Pesquisa
Operacional, 34:585–620, 2014.
[28] S. Gass and T. Saaty. The computational algorithm for the parametric objective function. Nav.
Res. Logist. Q., 2:39–45, 1955.
[29] M. Gendreau, O. Jabali, and W. Rei. Chapter 8: Stochastic vehicle routing problems. In Vehicle
Routing: Problems, Methods, and Applications, Second Edition, pages 213–239. SIAM, 2014.
[30] A. M. Geoffrion. Proper efficiency and the theory of vector maximization. J. Math. Anal. Appl.,
22:618–630, 1968.
[31] W. J. Gutjahr and A. Pichler. Stochastic multi-objective optimization: a survey on non-scalarizing
methods. Annals of Operations Research, 236:475–499, 2016.
29
[32] Y. V. Haimes. On a bicriterion formulation of the problems of integrated system identification and
system optimization. IEEE Transactions on Systems, Man, and Cybernetics, 1:296–297, 1971.
[33] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in
neural information processing systems, pages 3315–3323, 2016.
[34] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduc-
tion. In NIPS, pages 315–323, 2013.
[35] A. J. Kleywegt, A. Shapiro, and T. Homem de Mello. The sample average approximation method
for stochastic discrete optimization. SIAM J. Optim., 12:479–502, 2002.
[36] S. Liu and L. N. Vicente. Accuracy and fairness trade-offs in machine learning: A stochastic multi-
objective approach. ISE Technical Report 20T-016, Lehigh University, 2020.
[37] K. Miettinen. Nonlinear Multiobjective Optimization, volume 12. Springer Science & Business Media,
New York, 2012.
[38] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to
stochastic programming. SIAM J. Optim., 19:1574–1609, 2009.
[39] J. Oyola, H. Arntzen, and D. L. Woodruff. The stochastic vehicle routing problem, a literature
review, part i: models. EURO Journal on Transportation and Logistics, 7:193–221, 2018.
[40] L. R. Lucambio Pérez and L. F. Prudente. Nonlinear conjugate gradient methods for vector opti-
mization. SIAM J. Optim., 28:2690–2720, 2018.
[41] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J.
Control Optim., 30:838–855, 1992.
[42] S. Qu, M. Goh, and B. Liang. Trust region methods for solving multiobjective optimisation. Optim.
Methods Softw., 28:796–811, 2013.
[43] M. Quentin, P. Fabrice, and J. A. Désidéri. A stochastic multiple gradient descent algorithm.
European J. Oper. Res., 271:808 – 817, 2018.
[44] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22:400–407,
1951.
[45] J. Sacks. Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist.,
29:373–405, 1958.
[46] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient
solver for svm. Math. Program., 127:3–30, 2011.
[47] A. Shapiro. Monte Carlo sampling methods. Handbooks in Operations Research and Management
Science, 10:353–425, 2003.
[48] S. Verma and J. Rubin. Fairness definitions explained. In 2018 IEEE/ACM International Workshop
on Software Fairness (FairWare), pages 1–7. IEEE, 2018.
[49] K. D. Villacorta, P. R. Oliveira, and A. Soubeyran. A trust-region method for unconstrained multi-
objective problems with applications in satisficing processes. J. Optim. Theory Appl., 160:865–889,
2014.
[50] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory
predictors. In Conference on Learning Theory, pages 1920–1953, 2017.
[51] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms
for fair classification. In Artificial Intelligence and Statistics, pages 962–970, 2017.
[52] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In Interna-
tional Conference on Machine Learning, pages 325–333, 2013.
30