Data Driven EJOR
Data Driven EJOR
Decision Support
a r t i c l e i n f o a b s t r a c t
Article history: Robust optimization (RO) has been broadly utilized for decision-making under uncertainty; however, as
Received 22 May 2020 a key issue in RO the design of the uncertainty set could exert significant influence on both the conser-
Accepted 16 November 2020
vatism of solutions and tractability of induced problems. In this paper, we propose a novel multiple kernel
Available online 23 November 2020
learning (MKL)-aided RO framework for data-driven decision-making, by developing an efficient approach
Keywords: for uncertainty set construction from data based on one-class support vector machine. The learnt poly-
Uncertainty modelling hedral uncertainty set not only achieves a compact encircling of empirical data, which alleviates the pes-
Robust optimization simism and reduces the gap between the model and real-world performance, but also ensures structural
Uncertainty set sparsity and computational tractability. The data-driven RO framework enables a handy adjustment of the
Multiple kernel learning conservatism and complexity by simply manipulating two hyper-parameters, thereby being user-friendly
Data-driven decision-making in practice. In addition, the proposed framework applies to adjustable RO (ARO) with the extended affine
decision rule adopted, which helps improving the optimization performance without too much additional
effort. Numerical and application case studies demonstrate the effectiveness of the proposed data-driven
RO framework.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction tive risk evaluation and control, precise knowledge about distri-
bution of uncertainty is necessitated. Unfortunately, in most re-
Decision-making by solving a disciplined optimization problem alistic situations, even knowing the probability distribution is an
with some certain criterion optimized is a common demand aris- unattainable luxury. To alleviate such a concern, robust optimiza-
ing from diverse application areas in science and engineering. In a tion (RO) has been widely used as an effective non-probabilistic al-
real-world environment, parameters in optimization problems are ternative that only requires minimal distributional information. A
almost always influenced by uncertain factors, which make deter- comprehensive summary of developments and applications of RO
ministic models unreliable to some extent. It has been recognized can be found in the monograph (Ben-Tal et al. (2009)) and sev-
that even a small perturbation in nominal parameter may lead to eral review papers (Bertsimas, Brown, & Caramanis, 2011; Gabrel,
a strategy that is completely infeasible (Ben-Tal, El Ghaoui, & Ne- Murat, & Thiele, 2014). The general formulation of RO problems is
mirovski, 2009; Bertsimas & Thiele, 2006). Thus, modeling uncer- expressed as:
tainty has been of common interest across distinct fields. From
min f (x )
the celebrated definition by Camerer and Weber (1992), uncer- x∈R d (1)
tainty can be specified into risk that refers to probabilistic uncer- s.t. g( x ; u ) ∈ A , ∀u ∈ U ,
tainty with a known distribution, and ambiguity that refers to un-
certainty with an unknown distribution. The earliest optimization where some restrictions or constraints are denoted by g(x; u ) :
methods under uncertainty can be traced back to the pioneering Rd × Rn → belonging to A ⊂ for some space , u ∈ Rn is the
work of Dantzig (1955) on stochastic programming and Charnes uncertainty, and U ⊂ Rn is termed as the uncertainty set. Being im-
and Cooper (1959) on chance-constrained programming. For effec- mune against all probable realizations of u in D, the robust formu-
lation (1) has been widely adopted to ensure satisfaction of con-
straints on critical factors such timespan, capacity and safety under
∗
Corresponding author. all plausible scenarios of uncertainty (Balcik & Yanıkoğlu, 2020; Ca-
E-mail addresses: [email protected] (B. Han), c-shang@tsinghua. ballero, Lunday, & Uber, 2021; Dai et al., 2019; Jakubovskis, 2017;
edu.cn (C. Shang), [email protected] (D. Huang). Moret, Babonneau, Bierlaire, & Maréchal, 2020).
https://doi.org/10.1016/j.ejor.2020.11.027
0377-2217/© 2020 Elsevier B.V. All rights reserved.
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
However, (1) involves an infinite number of constraints, which (2016) proposed a correlated polyhedral uncertainty set on the
poses challenges in solving the problem. The design of the uncer- basis of estimated correlation matrix of uncertain coefficients.
tainty set U is a critical issue in RO that shall be guided by the In real-life applications, however, classical uncertainty sets still
following principles. First, an elaborate parameterization is neces- show prominent limitations in uncertainty description. Clearly,
sary to ensure the computational tractability, i.e. the possibility of their geometric structures are fixed a priori, resulting in a lim-
converting (1) into a tractable deterministic problem. On the other ited representation capability, while the underlying realistic dis-
hand, U shall cover all possible realizations of u accurately without tribution is diverse and may be extremely complicated, especially
unnecessary coverage, such that over-pessimistic solutions can be asymmetric. When faced with unknown complicated uncertain-
avoided. Concurrently considering both aspects has been a major ties, there is no effective and systematic guideline for the user to
streamline in RO community studied for decades. One of the ear- choose a suitable type of uncertainty set and determine parameters
liest attempts can be traced back to the box uncertainty set pro- thereof. Although the budget parameter can be specified based
posed by Soyster (1973), which assumes each uncertain parameter on partial distributional information to establish probabilistic guar-
to have independent perturbations within an interval: antees (Guzman, Matthews, & Floudas, 2016; Li, Tang, & Floudas,
2012), a poor specification of “geometry-related” parameters, such
ˆ ◦ ξ
U∞ = u = ū + u ξ ∞ , (2) as T in (5), can still lead to unsatisfactory performance.
where ū ∈ Rn represents the nominal value of uncertainty, u ˆ ∈ Rn Nowadays, we are witnessing a big data era where the data
denotes the “magnitude”, and ξ ∈ Rn is the normalized uncertainty. availability explodes and massive amounts of data are routinely
The element-wise product between two equally sized vectors is collected in many fields. This has spawned the paradigm of data-
denoted by ◦. is the so-called “budget” responsible for control- driven RO that straightforwardly injects data information into a ro-
ling the size of uncertainty set. Although U∞ secures tractability for bust decision-making schema (Bertsimas, Gupta, & Kallus, 2018;
a large class of RO problems, it has been recognized as being too Bertsimas & Thiele, 2006; Shang & You, 2019a). Data-driven RO
conservative for practical usage. Later on, ellipsoidal uncertainty seeks to build a data-driven uncertainty set U from historical
set has been developed independently by Ben-Tal and Nemirovski data D, such that distributional information of the uncertainty can
(1998, 1999), El-Ghaoui and Lebret (1997) and El-Ghaoui, Oustry, be seamlessly encompassed in U and over-conservatism can be
and Lebret (1998): reduced without tedious manual parameter tuning. A variety of
learning-based methods for uncertainty set construction have been
ˆ ◦ ξ
U2 = u = ū + u ξ 2 , (3) proposed, among which the simplest form is the hyper-rectangular
set constructed in a data-driven manner to cover all realizations
based on which the robust counterpart (RC) of a robust linear pro-
of uncertainty with high probability (Margellos, Goulart, & Lygeros,
gram (LP) can be translated into a second-order conic (SOC) pro-
2014). In Campbell and How (2015), a Bayesian nonparametric
gram. In comparison with the box set, the ellipsoidal uncertainty
method is presented under the assumption that the unknown dis-
set has better representation capability at the price of moderately
tribution belongs to a family of Dirichlet process Gaussian mix-
increased computational complexity. Alternatively, the polyhedral
tures, which was later developed and applied to data-driven adap-
uncertainty set is proposed by Bertsimas, Pachamanova, and Sim
tive robust optimization (Ning & You, 2017a; Ning & You, 2017b).
(2004) and Bertsimas and Sim (2003, 2004):
Other related works also develop alternative construction strate-
ˆ ◦ ξ
U1 = u = ū + u ξ 1 . (4) gies of data-driven uncertainty set, e.g. Zhang, Grossmann, Sun-
daramoorthy, and Pinto (2016a), Crespo, Colbert, Kenny, and Giesy
Uncertainty sets U1 , U2 , and U∞ altogether act as basic modeling (2019), etc. Based on these basic modeling tools, Hong, Huang, and
tools in RO, based on which intersections and unions thereof have Lam (2017) considered a class of uncertainty sets constructed as
been developed to promote the flexibility in uncertainty descrip- combinations of basic geometric shapes and investigated the prob-
tion, such as “interval + ellipsoidal” model, “polyhedral + ellip- ability guarantee in a data-driven scheme. An inclusive review on
soidal” model, etc (Ben-Tal & Nemirovski, 20 0 0; Bertsimas et al., the latest developments of data-driven RO can be found in Ning
2004; Li, Ding, & Floudas, 2011). and You (2019).
All aforesaid uncertainty sets are norm-based. Hence, an under- Amongst different formulations of U, polyhedral uncertainty
lying assumption is that uncertainty tends to spread symmetrically sets feature a desirable balance between the flexibility in leverag-
and radially from the center. Meanwhile, they also fail to account ing data information and the computational tractability of RC prob-
for correlations among different variables. To address this issue, lems. Zhang, Jin, Feng, and Rong (2018) proposed a heuristic to at-
Bertsimas and Sim (2004) developed an extended expression tain a data-based polytope by progressively adding cutting planes.
where correlated uncertainties are disentangled: Shang, Huang, and You (2017) proposed a systematic approach to
polyhedral set construction using support vector data description
u(ξ ) = ū + Tξ , (5)
(SVDD) with a tailored weighted generalized intersection kernel
where ξ ∈ Rmstands for the underlying independent uncertainties, (WGIK), which has found further applications in robust model pre-
and T ∈ Rn×m is the transformation matrix mapping m underlying dictive control (Shang & You, 2019b), energy system operations
uncertainty sources to n parameters. This approach was adopted (Shen, Zhao, Du, Zhong, & Qian, 2020), supply chain management
by Ferreira, Barroso, and Carvalho (2012) in a demand response (Mohseni & Pishvaee, 2020), and irrigation systems (Shang, Chen,
model, where {ū, T} are determined based on principal component Stroock, & You, 2020). In Ning and You (2018), a kernel density es-
analysis (PCA) and minimum power decomposition (MPD). Yuan, timation (KDE) approach was put forward. Notwithstanding their
Li, and Huang (2016) generalized classical symmetric uncertainty improved capabilities in dealing with data correlation and asym-
sets under (5) and established explicit formulations of associated metry, the performance of these non-parametric approaches criti-
RCs. A variety of alternatives have also been proposed to capture cally relies on projection directions, which are specified with PCA
the correlation and the asymmetry of uncertainties. For exam- as a common heuristic. However, whether such a heuristic yields
ple, Chen, Sim, and Sun (2007) proposed to construct U using the best performance is questionable.
forward and backward deviations. Natarajan, Pachamanova, and This work aims to address these issues by developing a novel
Sim (2008) developed an asymmetry-robust value-at-risk (VaR) non-parametric polytope learning approach that is capable of op-
measure to take into consideration asymmetries in the distribu- timally selecting projection directions while learning a compact
tion of portfolio returns. Jalilvand-Nejad, Shafaei, and Shahriari polyhedral set. In this way, it alleviates the need for tedious pa-
1005
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
rameter tuning and further assists decision-making by solving a space from the origin with a hyperplane by solving the following
data-driven RO problem. As an important ingredient in statistical problem1 :
learning theory, the multiple kernel learning (MKL) framework is
1
N
1
adopted in this work for support estimation, which allows for con- min w 2
−ρ + ξi
currently learning both an optimal combination of “candidate” ker- w,ρ ,ξ 2 Nν
i=1 (6)
nels and the induced uncertainty set. From a unified viewpoint s.t. wT φ (ui ) ρ − ξi , i = 1, . . . , N
of MKL, the aforesaid approaches based on SVDD (Shang et al., ξi 0 , i = 1 , . . . , N ,
2017) and KDE (Ning & You, 2018) can be regarded as utilizing all
where φ (· ) represents the feature map from input space X to
candidate kernels as a fixed combination, in the sense that they
the high dimensional feature space H, and the associated inner
are essentially single-kernel methods and have limited representa-
product can be calculated by evaluating its associated kernel func-
tion capabilities. Henceforth, the flexibility of MKL in learning op-
tion K (u, v ) = φ (u ), φ (v ) . Here, slack variables {ξi } allow some
timal kernels typically gives rise to better modeling power than
data points to be misclassified by the hyperplane (Huang, Shi, &
traditional single-kernel methods (Lanckriet, Cristianini, Bartlett,
Suykens, 2013). The regularization parameter ν ∈ (0, 1] controls
Ghaoui, & Jordan, 2004; Rakotomamonjy, Bach, Canu, & Grandvalet,
the tolerance of data points being “misclassified”, and its impli-
2008), and thus has the potential of yielding improved estima-
cations in uncertainty set constructions will be discussed in the
tion of the high-density region as a data-driven uncertainty set.
sequel. Upon solving (6), the decision function is explicitly written
In this paper, we propose a novel data-driven RO framework aided
as:
by MKL-based one-class SVM (OC-SVM), which has the following
advantages. y ( u ) = wT φ ( u ) − ρ f ( u ) − ρ , (7)
• By leveraging data information, a compactly enclosing polyhe- and the induced decision region as the data-driven uncertainty set
dral uncertainty set with optimal kernel functions, viz. the MKL can be expressed as:
uncertainty set, can be automatically learnt, which adapts to U ( D ) = {u|y ( u ) 0} = {u| f ( u ) − ρ 0}. (8)
data distribution with asymmetry without resorting to tedious
parameter tuning. Hence, the over-conservatism and the gap Problem (6) is essentially based on a single kernel function
between the model and the realistic performance can be re- K (·, · ), which shall be specified by the user prior to solving (6). Re-
duced. Besides, it yields a tractable low-complexity RC problem cent experiences in machine learning have shown that using mul-
since optimally selected kernel functions are sparse. tiple kernel functions can enhance the representation capability of
• The RO framework enables a convenient adjustment of both the model and the prediction performance (Lanckriet et al., 2004;
the conservatism and complexity, thereby being user-friendly Rakotomamonjy et al., 2008; Xu, Tsang, & Xu, 2013). Typically, the
in practice. Specifically, it bears clear statistical interpretations kernel K (u, v ) can be expressed as a convex combination of multi-
since the proportion of outliers can be manipulated, which not ple basis kernels {Km (·, · )}:
only helps rejecting extremal samples but also allows to adjust
M
1006
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
optimal kernel functions can be automatically selected, thereby Clearly, the dual formulation (17) of the proposed MKL-based
eventually leading to a compact representation of U. In this pa- OC-SVM has a neatly symmetrical structure with its primal formu-
per, we adopt the unified MKL framework proposed in Xu et al. lation (12). From the perspective of machine learning, the objective
(2013) to implement OC-SVM for uncertainty set construction. By of (17) is to maximize the “margin” γ while considering some “er-
replacing w and φ in (6) with (11) and considering the constraints rors” from M basis kernels, which are denoted by slack variables
on {πm }, we formulate the MKL-based OC-SVM as follows: {ζm } (also the Lagrangian multipliers). The regularization effect of
M N μ can also be evidenced from this dual formulation, because it
min 1
2
1
πm wm
2
− ρ + N1ν ξi gives a balance between two conflicting terms in the objective (Xu
{wm },π,ρ ,ξ m=1 i=1
M et al., 2013). Much as μ has a regularization effect on the spar-
s.t. wTm φm (ui ) ρ − ξi , i = 1, . . . , N sity of {πm }, ν is responsible for controlling the sparsity of {αi },
(12)
m=1 which can be observed from constraints on {αi } in (17). Note that
ξi 0 , i = 1 , . . . , N (17) is a quadratically constrained quadratic program (QCQP). If all
M
πm = 1, 0 πm 1
, m = 1, . . . , M. kernel functions {Km } are positive semi-definite, the dual problem
Mμ
m=1 is also convex, which can be handled by general-purpose convex
The regularization parameter μ on {πm } can be interpreted as fol- optimization softwares.
lows. To ensure the feasibility of the constraints on πm , there is Suppose we have attained both the primal
and dual optimal so-
an inherent requirement that μ ∈ (0, 1]. If μ > 1, feasible weights lutions, denoted as {wm }, π , ρ , ξ and α , γ , ζ , respectively.
{πm } no longer exist. If μ < 1/M, it follows that 1/Mμ > 1, render- Then the MKL uncertainty set can be expressed in terms of α , π
ing πm 1/Mμ redundant and thus the regularization effect on πm and kernel functions {Km } by substitution into (11) and (13):
vanishes. In this case, (12) reduces to the traditional formulation of
N
M
MKL-based OC-SVM (Han, Shang, Yang, & Huang, 2019; Rakotoma-
monjy et al., 2008), which tends to select the fewest kernels and Uν,μ (D ) = {u|y(u ) 0} = u α
π Km (u, ui ) ρ ,
i=1 i m=1 m
lead to over-fitting to some degree (Xu et al., 2013). If μ > 1/M,
sparsity within {πm } will be discouraged. This can be interpreted (18)
as that, when some {πm } are exactly zero, the remaining ones
tend to have large values to fulfill the equality m πm = 1, while where subscripts ν and μ highlight the dependence of U on hyper-
πm 1/Mμ penalizes “over-weighting” of decisive kernel functions parameters {ν, μ}. In the sequel, we will use U (D ) for brevity
and thus avoids undue sparsity. As an extreme case, when μ = 1, when no confusion is made.
all kernel weights are enforced to be 1/M, which amounts to tak-
ing the average of all kernels at hand. 2.2. The MKL-based piecewise linear uncertainty set
Note that (12) is a convex program since wm 2 /πm in the ob-
jective is a convex function, and it can be easily verified that strong Following the convention of classic SVM theory, the data sam-
duality is admitted (Boyd & Vandenberghe, 2004), so we can solve ples with nonzero dual variables {αi } are termed as support vec-
it from the dual. The Lagrangian of (12) can be written as: tors (SVs) because the others with {αi = 0} do not contribute to
1 1 1 the final expression of U (D ). The index set of SVs is denoted as
L= wm 2
−ρ + ξi
2 m πm Nν (Schölkopf et al., 2002):
i
SV = {i|αi > 0, ∀i }. (19)
− αi wTm φm (ui ) − ρ + ξi − βi ξi
i m i
Among the SVs, those data samples with nonzero dual vari-
ables {βi } (equivalently, {0 < αi < 1/Nν}) have m wm φm (ui ) =
T
1
−γ πm − 1 − η m πm + ζm πm − . ρ , ∀i because of the complementary slackness conditions αi
Mμ
m m m wmT φm (ui ) − ρ + ξi = 0, ∀i and βi ξi = 0, ∀i, and thus lie
m
By setting it to zero one obtains: right on the boundary of the uncertainty set U (D ), termed as
boundary support vectors (BSVs) (Schölkopf et al., 2002):
∂ L/∂ wm = 0 ⇒ wm = αi πm φm (ui ), ∀m (13)
i BSV = {i ∈ SV |βi > 0, ∀i } = {i|0 < αi < 1/Nν, ∀i }. (20)
∂ L/∂ ρ = 0 ⇒ αi = 1 (14) There is also a similar interpretation of complementary slack-
i ness for kernel functions. The complementary slackness condition
ηm π = 0 indicates that, for every basis kernel, either η = 0 or
m m
1 πm = 0 holds. Analogous to data samples with {αi = 0}, kernel
∂ L/∂ ξi = 0 ⇒ αi + βi = , ∀i (15)
Nν functions with {πm = 0} do not appear in U (D ), thereby giving
1007
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
obtained by Proof. The proof of Proposition 1 in Shang et al. (2017) applies mu-
(i )
= qTm ui /cm ·
tatis mutandis to the present setup. We denote by zm
1 κ , ∀ui ∈ D, m = 1, . . . , M the scaled data projections in shorthand,
SK = m− α α Km (ui , u j ) = γ − ζm , ∀m .
2 i, j i j and define the residual
1 T
Upon training the model, a significant proportion of data sam- m = 1 − max qm ( ui − u j )
ui ,u j ∈D cm · κ
ples and basis kernels can be discarded, with only SVs and SKs re-
sm /cm
tained necessarily. This property is pivotal to ensure the practical > 1− 0, ∀m
applicability of MKL-based OC-SVM, since the MKL uncertainty set maxm =1,··· ,M sm /cm
(18) simplifies to According to (23) and (24). Then we could construct valid upper
(i )
, ∀i as zm = cm1·κ maxui ∈D qTm ui + m /2 and
and lower bounds for zm
Uν,μ (D ) = u α
π Km (u, ui ) ρ .
(21) zm = 1
c m ·κ minui ∈D qTm ui − m /2. It is easy to see that Km = K+
m+
i∈SV i m∈SK m (i ) ( j)
m , where Km has entries Km (ui , u j ) = min{zm − zm , zm − zm } and
K− + +
(i ) ( j)
As a matter of fact, some off-the-shelf optimization toolboxes, K−
m has entries −
Km ( ui , u j ) = min{zm − zm , zm − zm }. Denoting by
such as cvx (Grant, Boyd, & Ye, 2008), can simultaneously solve (i ) (i )
wm = zm − zm > 0, it turns out that K+
m is a kernel matrix of con-
the primal and dual problems, and yield values of {αi } and {πm }.
ventional intersection kernel in one dimension and hence satisfies
In this case, we only need to calculate ρ by using the following
K+
m 0 (Odone, Barla, & Verri, 2005). The positive-definiteness of
property of all the BSVs
K−
m can be established in a similar fashion. Therefore, it holds that
Km = K+m + Km 0, ∀m.
−
αi πm Km (ui , u j ) − ρ = 0, ∀ j ∈ BSV
i∈SV m∈SK
Following the spirit of MKL, we intend to choose as many di-
to compute rections {qm }M as possible in the data space to yield sufficient
m=1
1 representative basis kernels for further selection. As each basis ker-
ρ = αi πm Km (ui , u j ).
|BSV | j∈BSV i∈SV m∈SK
nel (22) is designed to capture the information of D along a spe-
cific direction qm , we enforce MKL to “observe” the dataset in a
There are a variety of popular kernel functions in the MKL set- comprehensive manner in the hope that the most useful directions
ting, such as the Gaussian kernel, polynomial kernel, etc., which in- can be automatically selected to yield a desirable combined ker-
duce fairly good decision boundaries. Nonetheless, they cannot be nel. Based on such a motivation, there are several ways to choose
used to construct the uncertainty set for RO owing to their lack of M representative directions {qm }. A trivial idea is to sample {qm }
computational tractability. To this end, we propose a novel piece- randomly from the n-dimensional input space and then normalize
wise linear concave kernel structure for MKL to serve as the can- them to unit length. An alternative deterministic way is to let them
didate basis kernels in (9), i.e., be equispaced on the n-dimensional sphere, based on the follow-
qTm (u − v ) ing polar coordinate expression in n dimensions:
Km (u, v ) = 1 − . (22) ⎧
cm · κ ⎪qm,1 = cos(ψm,n−1 ) · · · cos(ψm,2 ) cos(ψm,1 )
⎪
⎪
⎪
Each basis kernel (22) is parameterized by qm , cm and κ . Intuitively ⎨qm,2 = cos(ψm,n−1 ) · · · cos(ψm,2 ) sin(ψm,1 )
.. ,
speaking, Km (u, v ) evaluates the negative “distance” between two .
⎪
⎪
data points along a particular projection direction qm , which is a ⎪
⎪qm,n−1 = cos(ψm,n−1 ) sin(ψm,n−2 )
⎩
unit direction vector of the mth basis kernel. The normalization qm,n = sin(ψm,n−1 )
factor cm describes the “dispersion” degree of D along qm , and is
where arguments {ψm, j } are evenly distributed in [0, π ) for a
thus used to “normalize” data along qm , which can be taken as the
given j ∈ {1, . . . , n − 1}. If we choose P directions along each di-
span, i.e.,
mension j, there will be M = P n−1 basis kernels in total2 . This de-
cm := max qTm (ui − u j ). (23) terministic method is adopted in this work. Note that, although our
ui ,u j ∈D
approach also requires user-defined projection directions, it has
However, the maximum operator is known to be sensitive to out- better flexibility in learning a good sparse combination of projec-
liers. Instead, a robust estimation of the span can be obtained tion directions from sufficiently representative candidates.
based on quantiles: We point out that the combination of our proposed basis kernel
cm := maxi {qTm ui } − max1i − {qTm ui }, (22) generalizes the weighted generalized intersection kernel (WGIK)
proposed by Shang et al. (2017), which can be expressed as:
where maxi {·} stands for the -upper quantile of the set {·}, and
can be set as a small number, for example, 0.01. Note that the
n
K ( u, v ) = κ − W (u − v )
positive-definiteness of Km (u, v ) remains a concern, and a scaling 1
m=1
factor κ > 0 is introduced such that the positive-definiteness of all
basis kernels can be ensured provided that κ is sufficiently large.
n qT ( u − v )
= κ 1 − m , (25)
Note that, to ensure positive-definiteness of all basis kernels, the cm · κ
m=1
scaling factor κ can be set as
where W is a weighting matrix that can be constructed as the
sm
κ > max , sm = max qTm (ui − u j ), (24) inverse of covariance matrix , {qm } are defined by eigenvector
m=1,...,M cm ui ,u j ∈D
of , and cm is the normalization coefficient along qm . Obviously,
according to the following proposition. WGIK (25) is identical to the average of n basis kernels in the form
of (22), and hence the single WGIK-based uncertainty set construc-
Proposition 1 (Positive-Definiteness of Piecewise Linear Ker- tion method in Shang et al. (2017) can be considered as a spe-
nels). Assume the scaling factor κ in (22) is set according to cial case of our proposed MKL-based formulation (12) with μ = 1.
(24), then all the basis kernel matrices {Km } with elements
Km (ui , u j ), ∀ui , u j ∈ D, m = 1, . . . , M satisfy 2
To avoid duplicate directions, it is recommended to let P be an odd number in
Km 0, ∀m = 1, . . . , M. high-dimensional situations.
1008
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
Fig. 1. Uncertainty sets learnt by single WGIK-based SVDD and MKL-based OC-SVM (ν = 0.01, μ = 0.05). Data points marked as pentagrams are the SVs. Green line segments
in the right diagram of (a) represent the eigen-directions of the data covariance matrix, and their lengths represent standard deviations of the data projection on the
corresponding directions. Red line segments in the right diagram of (b) represent the projection directions of SKs selected by MKL, and their lengths stand for kernel
weights. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
As a matter of fact, WGIK can be viewed as “observing” an n- Obviously, U (D ) is a polytope, which warrants the computa-
dimensional uncertainty dataset from at most n( M) orthogonal tional tractability of a broad class of RO problems, as to be dis-
directions defined by n basis kernels that are fixed a priori. This cussed in detail in the next section. The hyper-parameter ν inherits
leads to considerable restrictiveness in learning the uncertainty set the desirable interpretation of single-kernel OC-SVM, as described
from data, especially when the uncertainty distribution is strongly by the following proposition.
correlated and asymmetrical (As to be shown in Fig. 1(a), it may
produce much unnecessary coverage). In contrast, the MKL scheme Proposition 2 (Relationship between ν and the percentage of out-
for uncertainty set construction possesses significantly enhanced liers). Assuming the solution to (12) with 0 < ν 1 exists. Then ν is
flexibility. an upper bound on the percentage of outliers.
With a huge number of candidate piecewise linear kernels used,
only a fraction of them will be retained to build the combined ker- Proof. See Schölkopf, Platt, Shawe-Taylor, Smola, and Williamson
nel, with the majority discarded. In other words, most coefficients (2001) and Schölkopf et al. (2002).
{πm } of basis kernels will be exactly zero, and it turns out that
the MKL uncertainty set U (D ) can be represented by only SVs and Proposition 2 indicates that, the MKL uncertainty set Uν,μ (D )
SKs, which admits a concise representation and brings computa- encloses at least 100 × (1 − ν )% of N training samples. Henceforth,
tional convenience. Substituting (22) into (21) yields the following ν can be interpreted as the empirical confidence level of Uν,μ (D ).
RO-compatible MKL uncertainty set: The “volume” of Uν,μ (D ) can be controlled by ν with clear sta-
tistical implications. This not only endows Uν,μ (D ) with desirable
robustness against extremal outliers, but also renders the number
U ( D ) = u α
π Km (u, ui ) ρ
of outward points explicitly adjustable, which is intimately related
i∈SV i m∈SK m to the conservatism of RO. By contrast, the sizes of classical uncer-
tainty sets U1 , U2 and U∞ are controlled through the budget pa-
1 rameter , which has no explicit connection with the fraction of
= u α π qT ( u − ui ) 1 − ρ . (26)
i∈SV i m∈SK m cm · κ m data coverage.
1009
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
Remark 1. In fact, in the expression (26) of the uncertainty Combining (29) and (30), we arrive at the following linear system
set U (D ), ρ can also be regarded as a set parameter. Since
K(BSV,BSV ) 1 ∂ αBSV ∂ρ (BSV,SV )
−Km αSV ,
the LHS of the final expression of U (D ), i.e., i∈SV αi
m∈SK ,− T
= ∀m.
πm cm1·κ qTm (u − ui ), is a convex function in u, by adjusting ρ 1T 0 ∂πm ∂πm 0
one obtains different (1 − ρ )-sublevel sets. These sublevel sets are =:
convex and form a nested structure. That is, for all ρ1 and ρ2
satisfying ρ ρ1 ρ2 < 1, it holds that U (D; ρ1 ) ⊆ U (D; ρ2 ) ⊆ Defining as the sub-matrix of −1 with the last row and last
U (D; ρ ). This provides a practical method of constructing ambigu- column removed, we obtain:
ity sets satisfying the nesting condition in Wiesemann et al. (2014), ∂ αBSV (BSV,SV )
thereby being useful in convex DRO problems in literature. = −Km αSV , ∀m.
∂πm
2.3. An efficient learning algorithm Then the entries of the Hessian H are given by:
:= ∂ π∂m ∂J πn = −(α )T Km ∂π
∂α
2
need to set in advance a plenty of basis kernels for learning. = (α )T Km Kn(BSV,SV ) αSV (31)
Specifically, massive candidate basis kernels are required under
high-dimensional uncertainty. Meanwhile, a sufficient amount of
= (α )T K(SV,BSV ) K(BSV,SV ) α
SV m n SV , ∀m, n.
historical data are necessary for good description performance in Hence the Newton step s can be found by solving the following
order to accurately characterize the uncertainty distribution. These quadratic program (QP):
issues altogether lend the RO problem to a large-scale QCQP, which
1 T
cannot be efficiently solved by general-purpose solvers including min s Hs + gT s
cvx. Therefore, an efficient learning algorithm is indispensable. s 2
M
Many algorithms have been proposed for accelerating MKL- (32)
s.t. sm = 0,
based SVM (Aiolli & Donini, 2015; Jain, Vishwanathan, & Varma, m=1
2012; Lanckriet et al., 2004; Sonnenburg, Rätsch, Schäfer, & 0 πm + s m 1
Mμ
, ∀m
Schölkopf, 2006; Suzuki & Tomioka, 2011), but none of them ap-
where g is the gradient of J (π ):
plies to the present formulation (12) or (17). We develop herein
a new efficient HessianMKL algorithm, which is motivated by ∂J 1
gm := = − (α )T Km α , ∀m. (33)
Chapelle and Rakotomamonjy (2008). The idea is to alternately op- ∂πm 2
timize the following constrained optimization problem:
The whole procedure of the efficient learning algorithm for
min J (π ) MKL-based OC-SVM is summarized in Algorithm 1.
π
M
1
s.t. πm = 1 , 0 π m , ∀m
Mμ Algorithm 1: HessianMKL for MKL-Based OC-SVM (12).
m=1
using Newton’s method and solve the following standard single Input: M Kernel matrices {Km }, regularization parameters ν
kernel (kernel K = m πm Km ) OC-SVM problem J (π ) (or its dual) and μ.
through off-the-shelf SVM solvers such as libsvm (Chang & Lin, Output: Lagrangian multipliers α, kernel weights π .
2011): Set αi = 1/N, dm = 1/M;
⎧ while stopping criteriona not met do
⎪ 1 1 1
⎪
⎪ min wm 2 − ρ + ξi Solve (27) using existing solver to obtain the dual variable
⎨ {wm },ρ ,ξ 2 m πm Nν α;
i
J (π ) = s.t. wTm φm (ui ) ρ − ξi , ∀i (27) Compute the gradient g according to (33);
⎪
⎪
⎪
⎩ m Compute the Hessian H according to (31);
ξi 0, ∀i. Solve (32) using QP solver to obtain the Newton step s;
Choose step size τ via the exact or backtracking line
The crux in implementing this algorithm is to calculate the Hes-
search;
sian matrix and the Newton step s. According to complementary
Update π : π = π + τ · s.
slackness conditions, ∀ j ∈ BSV, we have
end
y (u j ) = αi πm Km (u j , ui ) − ρ = 0, ∀ j ∈ BSV,
i m a
The stopping criterion adopted in this paper is J 10−5 J.
⇐⇒ K(BSV,·) α − ρ 1 = 0,
where K(BSV,· ) denotes the submatrix of K where only the rows cor- 3. Computational tractability
responding to the BSVs have been preserved. Differentiating with
respect to πm , we have 3.1. The case of static RO
(BSV,· ) ∂α
∂ρ
Km α + K(BSV,·) − = 0, ∀m. (28) To illustrate the tractability of the MKL uncertainty set-induced
∂πm ∂πm
RO, we consider the following static robust linear programming
We have already known that α j = 0, ∀ j ∈ / SV and α j = 1/Nν, ∀ j ∈ (LP) problem:
SV \BSV, so ∂ α j /∂ πm = 0, ∀ j ∈
/ BSV, and hence (28) becomes
min cT x : A(u )x b, ∀u ∈ U ( D ) , (34)
(BSV,BSV ) ∂ α ∂ρ
x
(BSV,SV )
Km αSV + K BSV
− = 0, ∀m. (29) where c ∈ Rn1 , b ∈ Rm , and A(u ) ∈ Rm×n1 is the left-hand side
∂πm ∂πm
(LHS) coefficient matrix affected by uncertainty u ∈ Rn . It suffices
On the other hand, according to the equality i αi = 1, we have
to consider the following single constraint (Ben-Tal et al., 2009):
∂α
∂ αBSV
i
=0 ⇒ 1T = 0, ∀m. (30) a(u )T x b, ∀u ∈ U ( D ). (35)
i
∂πm ∂πm
1010
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
For the sake of exposition, we assume a(u ) to be affine in u: construction of a high-complexity uncertainty set with moderate
sizes of SVs and SKs. Meanwhile, (37) involves (2|SV ||SK | + 1 ) ad-
a(u ) = ā + Pu, (36) ditional variables in total. However, the overall complexity tends to
be benign thanks to the sparsity of SVs and SKs in MKL, even if a
where ā ∈ Rn1
is constant and P ∈ Rn1 ×n . Then the tractability of plenty of training data and candidate basis kernels are provided for
(35) can be established as follows. constructing U (D ). As a matter of fact, ν and μ are informative in
Theorem 1. (35) is equivalent to the following system of linear con- revealing proportions of SVs and SKs that are eventually selected.
straints Proposition 3 (Relationship between ν and the percentage of
⎧
⎪
T
(μim − λim ) qcmm·uκi + η (1 − ρ ) b − āT x SVs). Assume the solution of (12) with ν ∈ (0, 1] exists, then ν is an
⎪
⎪
⎪i∈SV
⎪ m ∈ SK lower bound on the percentage of SVs.
⎨ (μim − λim ) cqmm·κ = PT x
i∈SV m∈SK (37) Proof. See Schölkopf et al. (2002).
⎪
⎪λim + μim = ηαi πm , ∀i ∈ SV, m ∈ SK
⎪
⎪ Proposition 4 (Relationship between μ and the percentage of
⎩λim 0, μim 0, ∀i ∈ SV, m ∈ SK
⎪ SKs). Assume the solution of (12) with μ ∈ (0, 1] exists, then μ is
η 0.
a lower bound on the percentage of SKs.
Proof. (35) can be rewritten in a worst-case sense:
Proof. Similar to the proof of Proposition 3, SKs can contribute
max uT PT x b − āT x. (38) at most 1/Mμ to constraint m πm = 1 of (12) due to constraint
u∈U (D )
0 πm 1/Mμ, hence there must be at least Mμ SKs, which es-
To eliminate 1 -norms in U (D ), auxiliary variables = tablishes μ as a lower bound of the fraction.
{θim }i∈SV,m∈SK are introduced together with the primitive un-
The above results imply that, in principle, the decision-maker
certainty u, giving rise to the following extended uncertainty set:
becomes aware of the “minimal” complexity of the RC problem
based on ν and μ. In addition, it is known that the number of
⎧ ⎫
⎨ −θim qTm (u−ui )
θim , ∀i ∈ SV, m ∈ SK ⎬ SVs (SKs) is non-decreasing in ν and μ. These guidelines can assist
c m ·κ
U˜ν,μ (D ) = (u, ) , the user to flexibly adjust the complexity of the induced RC prob-
⎩ αi
πm θim 1 − ρ ⎭ lem with ν and μ, thereby rendering the data-driven RO scheme
i∈SV m∈SK easy-to-use in practice.
(39)
Remark 2. Based on the tractability result, a further connection
which is cast as a series of linear inequalities. Obviously, U (D ) is can be established with the well-known sample average approxi-
the projection of U˜ (D ) into the space of primitive uncertainty u, mation (SAA) for the chance constraint (Ben-Tal et al., 2009):
and thus (35) becomes: P { g( x ; u ) ∈ A } 1 − ν , (42)
max uT PT x b − āT x, (40) where the uncertainty u is approximated by its empirical distri-
(u,)∈U˜ν,μ (D )
bution Pˆ with discrete and finite support, i.e., Pˆ {u = ui } = 1/N, i =
whose LHS can be expressed as the following LP: 1, . . . , N. A natural corollary can be established that a solution x
feasible for the robust constraint in (1) induced by U (D ) must
max uT PT x
u, be also feasible for the SAA-based chance constraint (42) under
qT ( u − ui ) risk level ν . A conventional way to handle SAA-based chance con-
s.t. −θim m θim , ∀i ∈ SV, m ∈ SK (41)
cm · κ
strained linear programs is to resort to MIP formulations using the
αi
πm θim 1 − ρ .
“big-M” technique (Luedtke, Ahmed, & Nemhauser, 2010). With the
i∈SV m∈SK
MKL uncertainty set-aided RO adopted as a high-fidelity safe ap-
proximation to (42), the usage of auxiliary integer variables be-
Because the feasible region of (41) is obviously bounded and comes no longer necessary, thereby avoiding heavy computational
nonempty, (41) must have a bounded optimal value. According cost.
to the strong duality of LP, the dual problem is also feasible and
bounded, and hence optimal values of the primal and dual coin- To sum up, the MKL uncertainty set lends itself to a system-
cide. Then, by deriving the dual problem, the infinite-dimensional atic approach to uncertainty set construction as well as a power-
constraint (35) can be translated to (37), where {λim }, {μim } and η ful data-driven alternative to classical ones U1 , U2 , and U∞ . With
are dual variables of (41). conservatism and complexity conveniently adjusted, the MKL un-
certainty set acts as a user-friendly modeling tool for uncertainty.
The above LP reformulation of the robust constraint sheds light Beyond its capability of characterizing unimodal uncertainty dis-
on the desirable computational tractability secured by the pro- tributions, the MKL uncertainty set is also useful for dealing with
posed MKL-based uncertainty set in solving a class of static ro- multi-modal distributions and non-convex support. In principle,
bust optimization problems. For example, when the original deter- one can first carry out clustering and then construct uncertainty
ministic problem is an MILP, the RC problem remains an MILP as set for each data cluster individually, or partition a non-convex set
well, which can be conveniently handled by off-the-shelf solvers. into several convex regions (Zhang et al., 2016a). The overall uncer-
In a nutshell, the RC problem is of the same type as the deter- tainty set can be formed by their union, with the computational
ministic problem without robustification, provided that the uncer- tractability of the induced RO still preserved. For example, it can
tainty penetrates into constraints multiplicatively. Moreover, based serve as a basic modeling strategy in some learning procedures
on the primal-dual saddle dynamics approach that has recently where various types of uncertainty sets are integrated, e.g. Hong
emerged (Ebrahimi, Vaidya, & Elia, 2019), the proposed MKL uncer- et al. (2017), Zhang et al. (2018), and Alexeenko and Bitar (2020).
tainty set can be used to tackle the general robustified constraint
g(x, u ) 0, ∀u ∈ U where g is convex in x and strictly concave in 3.2. The case of adaptive RO
u.
Notice that, (39) indicates that the uncertainty set has Next we discuss the particular usage of the proposed MKL
(2|SV ||SK | + 1 ) facets approximately. Hence, it enables an efficient uncertainty set in multi-stage decision-making under sequentially
1011
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
emerging uncertainties. We first consider the role of the MKL un-
min cT x + max min dT1 y1 + · · · + max min dTT yT ,
certainty set in the following two-stage ARO problem (Ben-Tal, x∈X u1 ∈U1 y1 ∈1 (x,u1 ) uT ∈UT yT ∈T (x,y1:T −1 ,u1:T )
Goryashko, Guslitzer, & Nemirovski, 2004; Yanıkoğlu, Gorissen, &
(48)
den Hertog, 2019):
where u1 , . . . , uT are uncertainties revealed over T stages, and
min cT x + max min dT y , (43) y1 , . . . , yT are recourse decisions made from stage 1 to stage T .
x∈X u∈U (D ) y∈(x,u )
The uncertainty set U for the concatenation of stage-wise uncer-
where x ∈ Rn1 is a vector of here-and-now decisions that are made tainties u = [u1 ; u2 ; . . . ; uT ] is the Cartesian product of all stage-
prior to realization of uncertainty, and y ∈ Rn2 denotes wait-and- wise sets, i.e., U = U1 × U2 × · · · × UT . The feasible region t :=
see decisions that can be made adaptively after seeing u ∈ Rn . The t (x, y1:t−1 , u1:t ) in stage t is dependent on here-and-now deci-
set X represents the feasible region of x, while the feasible re- sions x and all preceding recourse decisions and realizations of un-
gion (x, u ) of y is reliant on both here-and-now decisions and certainty:
past uncertainty, which can be typically described using linear con- t (x, y1:t−1 , u1:t ) = {yt |At (u )x + Bt y ht (u ) }.
straints:
In process systems engineering, multi-stage ARO has found
(x, u ) = {y|A(u )x + By h(u ) },
widespread applications in process scheduling and planning
where A(u ) and h(u ) are uncertain coefficients that are affine in (Lappas & Gounaris, 2016; Ning & You, 2017b; Zhang, Morari,
u. Coefficient matrix B is assumed to be constant representing Grossmann, Sundaramoorthy, & Pinto, 2016b). In the multi-stage
fixed recourse, which is a standard assumption in literature (Chen setting, there are two intuitively plausible ways of using the MKL
& Zhang, 2009). In general, the two-stage ARO problem (43) is uncertainty set. The first is to model each stage-wise uncertainty
intractable because one has to optimize over the policy y(u ) in set Ut as an MKL-based one, which may help capturing from data
a functional space. To circumvent this conundrum, Ben-Tal et al. the individual distributional geometry of stage-wise uncertainty.
(2004) proposed a simple affine decision rule (ADR) assuming that Similar to the two-stage case, EADR can still be applied with stage-
wait-and-see variables have simple linear dependence on u, i.e., wise auxiliary variables t utilized:
1012
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
0.6 1
0.4
Percentage
Percentage
0.6
0.3
0.4
0.2
0.2
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1
which eventually yields a tractable RC of the multi-stage ARO prob- Dataset cvx HessianMKL
lem (48). As to be illustrated in Section 5, such a strategy is helpful 1 23.7438 ± 0.0454 0.3046 ± 0.0012
for alleviating the solution’s conservatism at the price of slightly 2 26.3713 ± 0.0771 0.3058 ± 0.0031
increased computational cost. 3 42.0645 ± 0.0649 0.3479 ± 0.0009
4 25.6820 ± 0.0694 0.3068 ± 0.0020
5 105.3523 ± 1.7732 0.8140 ± 0.0045
4. Computational studies 6 24.5734 ± 0.0514 0.6510 ± 0.0018
7 29.1698 ± 0.0498 0.3042 ± 0.0017
4.1. Uncertainty set constructions 8 27.8394 ± 0.0430 0.5381 ± 0.0022
9 27.9536 ± 0.0559 0.4208 ± 0.0017
1013
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
5 4 4.5
4
4
4 3
3.5
3
3 2 3
2
2.5
2 1
2
1
1 0 1.5
1 2 3 4 1 2 3 4 5 1 2 3 4 5 2 3 4
5 4 5
5
3 4 4
4
3
2 3
3 2
1 2
1
2
0 0 1
-1
1 -1 0
0 1 2 3 4 0 1 2 3 4 5 0 2 4 0 1 2 3 4 5
Fig. 3. Uncertainty sets constructed on different datasets (ν = 0.01, μ = 0.05). Polytopes learnt by MKL-based OC-SVM and single-kernel SVDD are marked in red and green,
respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
1014
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
3.6 180
Predicted demand
Demand with error
160
3.5
Optimal Value
Demand
140
3.4
120
3.3
100
3.2 80
0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6
Planning horizon
Fig. 5. RO performance under varying μ. Fig. 6. Predicted demands and prediction error data.
Table 2
Parameter setup in production-inventory problem.
basis kernels has to be used. This clearly demonstrates the effect
of kernel selection in reducing the conservatism of kernel-based Parameter Value
uncertainty sets. In addition, the performance of the MKL uncer- P 200
tainty set can be further tuned by adjusting μ. Recall that when Q 960
0 μ 1/M, the proposed formulation reduces to traditional MKL Vmin 100
Vmax 300
OC-SVM (Han et al., 2019; Rakotomamonjy et al., 2008), which se-
c [4.11, 2.69, 1.57, 1.89, 3.31, 4.43]T
lects the fewest kernels and leads to the risk of over-fitting (Xu
et al., 2013). It is hence recommended to set the value of μ slightly
higher than 1/M to attain a desirable performance and a moderate Table 3
Optimal values and solution times of the different models.
computational complexity.
Model Optimal value Solution time (seconds)
5. Multi-stage production-inventory management Box + ADR 1944.27 0.27
SVDD + ADR 1954.85 0.31
In this section, we investigate the utilization of our proposed MKL + ADR 1859.31 0.48
MKL + EADR 1843.92 0.86
MKL uncertainty set in a production-inventory problem, which can
be cast in a multi-stage robust optimization setting (Ben-Tal et al.,
2004). Consider a factory with a warehouse producing a single
product. The goal is to determine its production plan for a planning help capturing correlation and asymmetry of distribution of uncer-
horizon of T periods, by minimizing the production cost while sat- tainty
over horizon, and we denote by U (D ) =
the entire planning
isfying the market demand and inventory level constraints in each δδ = δ + w, ∀w ∈ W (D ) the uncertainty set of δ. In a multi-
period. Mathematically, the deterministic problem can be expressed stage decision-making procedure, the decision maker is allowed
to adjust pt after knowing market demands δ1:t−1 = [δ1 , . . . , δt−1 ]
T
as the following LP:
observed prior to period t. To minimize the worst-case production
T
min ct pt cost, a multi-stage ARO formulation is given by:
{ pt },{vt } t=1
s.t. 0 pt P, t = 1, . . . , T max min c1 p1 + max min
T δ1 ∈U1 p1 ∈1 δ2 ∈U2 p2 ∈2 (v1 )
pt Q
t=1
vt = vt−1 + pt − δt , t = 1, . . . , T c2 p2 + · · · + max min
δT ∈UT pT ∈T (vT −1 )
cT pT , (54)
Vmin vt Vmax , t = 1, . . . , T ,
where t (vr−1 ) is the feasible region of pt defined by the inven-
where ct and pt are the production cost and production quantity
tory level in the preceding period. Ut is the projection of U (D ) onto
in period t. P and Q are maximal production limits in each period
the space of δt . The above problem instantiates the general multi-
and over the entire planning horizon. δt is the market demand in
stage ARO formulation (48), and thus can be approximated as an
period t, and vt is the inventory level in period t, with the minimal
LP problem by means of ADR and EADR.
allowed level Vmin and the maximal storage capacity Vmax .
In this case study, we consider a planning horizon with T =
Under demand uncertainty, it is assumed that a prediction
6 periods in total. A dataset of prediction error {wi }500 i=1
has al-
model is available such that the uncertain demand decomposes
ready been collected from past experience. The predicted demands
into:
and those added with empirical prediction errors are shown in
δt = δt + wt , t = 1, . . . , T , (53) Fig. 6, where the variance of prediction error increases over stages.
T Other parameters in the optimization problem are summarized in
where the nominal demand δ = δ1 , . . . , δT can be predicted by Table 2. The initial inventory level is set as v0 = 150.
the model at hand and the prediction error w = [w1 , . . . , wT ]T is We adopt three different uncertainty sets to learn from data
assumed to reside in a time-invariant uncertainty set W. Histor- and then formulate the multi-stage ARO problem, including the
ical data of prediction errors can be accumulated as a dataset classical box uncertainty set U∞ , the single-kernel SVDD set (Shang
D, based on which the proposed MKL uncertainty set W (D ) can et al., 2017), and the proposed MKL uncertainty set. We use ν =
be established. The use of a data-driven uncertainty set may 0.01 for the last two kernel-based sets, which ensures that both
1015
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
Table 4
Computational complexity with increasing T of the “MKL + EADR” model.
T #BK Com. mem. Lea. time (seconds) #Con. #Var. Sol. time (seconds)
sets contain at least 99% of historical data. For a fair comparison, method is suitable for uncertainty with dimensions lower than 8.
the box set is calibrated to include 99% of historical data. To con- Meanwhile, the sizes of constraints and variables in the robust
struct the MKL uncertainty set, kernel functions are specified with counterpart problem grow with n. In spite of growing complexity,
truncated uncertainty samples based on (50) to be compatible with it is quite computationally thrifty to solve the robust counterpart
the non-anticipativity requirement such that the usage of EADR is problem to obtain the decision, which is due to an LP formulation
plausible. In contrast, the box set and the single-kernel SVDD set of the robust counterpart problem.
only enable the use of ADR. All induced RC problems can be cast Despite such challenges, in face of high-dimensional uncer-
as LPs. The optimal values and solution times of different models tainty one can resort to some useful statistical methods together
are summarized in Table 3. with the proposed approach to circumvent the curse of dimen-
It can be observed that, the model with box uncertainty set and sionality. For example, if all uncertain parameters are closely cor-
ADR yields an optimal value of 1944.27 as well as the lowest solu- related, one could perform dimension reduction (e.g. based on
tion time. The single-kernel SVDD model with ADR yields a slightly PCA) first, and then construct an uncertainty set in the reduced-
higher optimal value. This may be due to the curse of dimension- dimensional subspace (Shang & You, 2018). In this way, the curse
ality that traditional single-kernel methods typically suffer from, of dimensionality can be much alleviated. Conversely, if all uncer-
because in this case δ has six dimensions. By contrast, the use of tain parameters are not closely correlated, then one can split them
the proposed MKL uncertainty set with ADR leads to a significantly into several independent groups of smaller sizes, and then utilize
reduction of production cost. This sheds light on the effectiveness the proposed MKL uncertainty set to handle each group individ-
of learning with multiple kernel selected in alleviating the curse of ually. Henceforth, these two basic strategies can also be jointly
dimensionality and the conservatism of learning-based RO. Beyond adopted to handle general cases. In this way, high-dimensional un-
that, without paying too much additional effort, one can attain an certainty can still be tackled with our MKL approach serving as
improved performance using the MKL set along with EADR, which a basic modeling tool. All above-mentioned facts draw a compre-
owes to both the modeling power of MKL and the expressiveness hensive picture about the effectiveness and limitations of the pro-
of EADR. Note that computational burdens of using MKL sets are posed MKL-based uncertainty set, thereby providing insights into
only slightly higher than those associated with box set and SVDD its practical usage.
set, mainly because of the use of multiple basis kernels in the un-
certainty set. Nevertheless, the resultant LP problems can still be 6. Concluding remarks
tackled efficiently with off-the-shelf solvers. Therefore, integrating
the MKL uncertainty set with EADR can be an appealing choice to In this work, we propose a novel data-driven RO framework –
approximately solve multi-stage ARO problems. MKL-aided RO – to cope with static RO as well as ARO. With our
MKL-based OC-SVM learning approach, a compactly enclosing un-
5.1. Scalability of the proposed approach and discussions certainty set can be obtained, especially when the distribution of
the uncertainty has evident asymmetry, which alleviates the over-
In order to investigate the practical scalability of the ap- conservatism of the induced RO and narrows the gap between the
proach proposed in this paper, we still take the above multi-stage optimization model and the real-world situation. Thanks to the
production-inventory management problem as an example. Using inherent sparsity of the MKL-based OC-SVM, the uncertainty set
the “MKL + EADR” model, we investigate the number of candi- turns out to be a polytope with succinct expression, which ensures
date basis kernels (#BK), the minimal computer memory (Com. the computational tractability of the induced RO. The MKL-aided
Mem.) and the learning time (Lea. time) required for the uncer- RO framework is user-friendly because two hyper-parameters have
tainty set learning, the number of constraints (#Con.), variables explicit statistical meaning, which allow the user to conveniently
(#Var.) and the solution time (Sol. time) of the induced tractable balance between the conservativeness and the computational cost.
RC with progressively increasing T (see Table 4). The uncertain This data-driven RO framework is also compatible with EADR in
data D = {wi }N in this experiment follow a mixture Gaussian dis- the multi-stage ARO problems, which further improves optimiza-
i=1
tribution of varying dimensions, and the data size is N = 500. The tion performance. Finally, numerical and application case studies
demonstrate the capability of the proposed framework in further
predicted demands δ are set as random, and the other settings are
the same as the previous case. improving the RO performance without bringing excessive compu-
Table 4 shows that the proposed approach can model uncer- tational burden.
tainties of dimension n 7 at a computational cost less than 5 min
on a personal computer, with P = 4 directions specified along each Acknowledgments
dimension. When one proceeds with n = 7, P = 5 or n = 8, P = 4,
about two hours are needed for modeling, which exhibit the curse This work is supported in part by National Science and Technol-
of dimensionality obviously. When n continuous to grow, the com- ogy Innovation 2030 Major Project of the Ministry of Science and
putational cost becomes prohibitively unaffordable since both the Technology of China under Grant 2018AAA0101604, and National
solution time and memory needed by HessianMKL algorithm go Natural Science Foundation of China (Nos. 61673236, 61433001,
beyond limits that are practically acceptable. Hence, the proposed and 61873142).
1016
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
References Han, B., Shang, C., Yang, F., & Huang, D. (2019). Multiple kernel learning-based un-
certainty set construction for robust optimization. In Proceedings of the 15th
Aiolli, F., & Donini, M. (2015). EasyMKL: A scalable multiple kernel learning algo- IEEE international conference on control and automation (ICCA) (pp. 1417–1422).
rithm. Neurocomputing, 169, 215–224. IEEE.
Alexeenko, P., & Bitar, E. (2020). Nonparametric estimation of uncertainty sets for Hong, L. J., Huang, Z., & Lam, H. (2017). Learning-based robust optimization: Proce-
robust optimization. arXiv preprint arXiv:2004.03069. dures and statistical guarantees. arXiv preprint arXiv:1704.04342.
Balcik, B., & Yanıkoğlu, İ. (2020). A robust optimization approach for humanitarian Huang, X., Shi, L., & Suykens, J. A. K. (2013). Support vector machine classifier with
needs assessment planning under travel time uncertainty. European Journal of pinball loss. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5),
Operational Research, 282(1), 40–57. 984–997.
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization. Princeton Jain, A., Vishwanathan, S. V., & Varma, M. (2012). SPG-GMKL: generalized multiple
University Press. kernel learning with a million kernels. In Proceedings of the 18th ACM SIGKDD
Ben-Tal, A., Goryashko, A., Guslitzer, E., & Nemirovski, A. (2004). Adjustable ro- international conference on knowledge discovery and data mining (pp. 750–758).
bust solutions of uncertain linear programs. Mathematical Programming, 99(2), ACM.
351–376. Jakubovskis, A. (2017). Strategic facility location, capacity acquisition, and tech-
Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization. Mathematics of nology choice decisions under demand uncertainty: Robust vs. non-robust
Operations Research, 23(4), 769–805. optimization approaches. European Journal of Operational Research, 260(3),
Ben-Tal, A., & Nemirovski, A. (1999). Robust solutions of uncertain linear programs. 1095–1104.
Operations Research Letters, 25(1), 1–13. Jalilvand-Nejad, A., Shafaei, R., & Shahriari, H. (2016). Robust optimization under
Ben-Tal, A., & Nemirovski, A. (20 0 0). Robust solutions of linear programming correlated polyhedral uncertainty set. Computers & Industrial Engineering, 92,
problems contaminated with uncertain data. Mathematical Programming, 88(3), 82–94.
411–424. Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learn-
Bertsimas, D., Brown, D. B., & Caramanis, C. (2011). Theory and applications of ro- ing the kernel matrix with semidefinite programming. Journal of Machine learn-
bust optimization. SIAM Review, 53(3), 464–501. ing research, 5(Jan), 27–72.
Bertsimas, D., Gupta, V., & Kallus, N. (2018). Data-driven robust optimization. Math- Lappas, N. H., & Gounaris, C. E. (2016). Multi-stage adjustable robust optimization
ematical Programming, 167(2), 235–292. for process scheduling under uncertainty. AIChE Journal, 62(5), 1646–1667.
Bertsimas, D., Pachamanova, D., & Sim, M. (2004). Robust linear optimization under Li, Z., Ding, R., & Floudas, C. A. (2011). A comparative theoretical and computational
general norms. Operations Research Letters, 32(6), 510–516. study on robust counterpart optimization: I. Robust linear optimization and ro-
Bertsimas, D., & Sim, M. (2003). Robust discrete optimization and network flows. bust mixed integer linear optimization. Industrial & Engineering Chemistry Re-
Mathematical Programming, 98(1–3), 49–71. search, 50(18), 10567.
Bertsimas, D., & Sim, M. (2004). The price of robustness. Operations Research, 52(1), Li, Z., Tang, Q., & Floudas, C. A. (2012). A comparative theoretical and compu-
35–53. tational study on robust counterpart optimization: II. Probabilistic guarantees
Bertsimas, D., & Thiele, A. (2006). Robust and data-driven optimization: modern de- on constraint satisfaction. Industrial & Engineering Chemistry Research, 51(19),
cision making under uncertainty. In Models, methods, and applications for inno- 6769–6788.
vative decision making (pp. 95–122). INFORMS. Luedtke, J., Ahmed, S., & Nemhauser, G. L. (2010). An integer programming approach
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University for linear programs with probabilistic constraints. Mathematical Programming,
Press. 122(2), 247–272.
Caballero, W. N., Lunday, B. J., & Uber, R. P. (2021). Identifying behaviorally robust Margellos, K., Goulart, P., & Lygeros, J. (2014). On the road between robust optimiza-
strategies for normal form games under varying forms of uncertainty. European tion and the scenario approach for chance constrained optimization problems.
Journal of Operational Research, 288(3), 971–982. IEEE Transactions on Automatic Control, 59(8), 2258–2263.
Camerer, C., & Weber, M. (1992). Recent developments in modeling preferences: Un- Mohseni, S., & Pishvaee, M. S. (2020). Data-driven robust optimization for wastewa-
certainty and ambiguity. Journal of Risk and Uncertainty, 5(4), 325–370. ter sludge-to-biodiesel supply chain design. Computers & Industrial Engineering,
Campbell, T., & How, J. P. (2015). Bayesian nonparametric set construction for ro- 139, 105944.
bust optimization. In Proceedings of the American control conference (ACC), 2015 Moret, S., Babonneau, F., Bierlaire, M., & Maréchal, F. (2020). Decision support for
(pp. 4216–4221). IEEE. strategic energy planning: A robust optimization framework. European Journal
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM of Operational Research, 280(2), 539–554.
Transactions on Intelligent Systems and Technology, 2(3), 27. Natarajan, K., Pachamanova, D., & Sim, M. (2008). Incorporating asymmetric distri-
Chapelle, O., & Rakotomamonjy, A. (2008). Second order optimization of kernel pa- butional information in robust value-at-risk optimization. Management Science,
rameters. In Proceedings of the NIPS workshop on kernel learning: Automatic se- 54(3), 573–585.
lection of optimal kernels: 19 (p. 87). Ning, C., & You, F. (2017a). Data-driven adaptive nested robust optimization: General
Charnes, A., & Cooper, W. W. (1959). Chance-constrained programming. Management modeling framework and efficient computational algorithm for decision making
Science, 6(1), 73–79. under uncertainty. AIChE Journal, 63(9), 3790–3817.
Chen, X., Sim, M., & Sun, P. (2007). A robust optimization perspective on stochastic Ning, C., & You, F. (2017b). A data-driven multistage adaptive robust optimization
programming. Operations Research, 55(6), 1058–1071. framework for planning and scheduling under uncertainty. AIChE Journal, 63(10),
Chen, X., Sim, M., Sun, P., & Zhang, J. (2008). A linear decision-based approximation 4343–4369.
approach to stochastic programming. Operations Research, 56(2), 344–357. Ning, C., & You, F. (2018). Data-driven decision making under uncertainty integrating
Chen, X., & Zhang, Y. (2009). Uncertain linear programs: Extended affinely ad- robust optimization with principal component analysis and kernel smoothing
justable robust counterparts. Operations Research, 57(6), 1469–1482. methods. Computers & Chemical Engineering, 112, 190–210.
Crespo, L. G., Colbert, B. K., Kenny, S. P., & Giesy, D. P. (2019). On the quantification Ning, C., & You, F. (2019). Optimization under uncertainty in the era of big data
of aleatory and epistemic uncertainty using sliced-normal distributions. Systems and deep learning: When machine learning meets mathematical programming.
& Control Letters, 134, 104560. Computers & Chemical Engineering, 125, 434–448.
Dai, X., Wang, X., He, R., Du, W., Zhong, W., Zhao, L., & Qian, F. (2019). Data– Odone, F., Barla, A., & Verri, A. (2005). Building kernels from binary strings for im-
driven robust optimization for crude oil blending under uncertainty. Computers age matching. IEEE Transactions on Image Processing, 14(2), 169–180.
& Chemical Engineering, 106595. Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2008). SimpleMKL. Journal
Dantzig, G. B. (1955). Linear programming under uncertainty. Management Science, of Machine Learning Research, 9(Nov), 2491–2521.
1(3–4), 197–206. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
Ebrahimi, K., Vaidya, U., & Elia, N. (2019). Robust optimization via discrete-time sad- Estimating the support of a high-dimensional distribution. Neural Computation,
dle point algorithm. In Proceedings of the 58th IEEE conference on decision and 13(7), 1443–1471.
control (CDC) (pp. 2473–2478). IEEE. Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: Support vector
El-Ghaoui, L., & Lebret, H. (1997). Robust solutions to least-square problems to machines, regularization, optimization, and beyond. MIT Press.
uncertain data matrices. SIAM Journal on Matrix Analysis and Applications, 18, Shang, C., Chen, W. H., Stroock, A. D., & You, F. (2020). Robust model predictive
1035–1064. control of irrigation systems with active uncertainty learning and data analytics.
El-Ghaoui, L., Oustry, F., & Lebret, H. (1998). Robust solutions to uncertain semidef- IEEE Transactions on Control Systems Technology, 28, 1493–1504.
inite programs. SIAM Journal on Optimization, 9(1), 33–52. Shang, C., Huang, X., & You, F. (2017). Data-driven robust optimization based on
Ferreira, R. d. S., Barroso, L., & Carvalho, M. M. (2012). Demand response models kernel learning. Computers & Chemical Engineering, 106(2), 464–479.
with correlated price data: A robust optimization approach. Applied Energy, 96, Shang, C., & You, F. (2018). Robust optimization in high-dimensional data space with
133–149. support vector clustering. IFAC-PapersOnLine, 51(18), 19–24.
Gabrel, V., Murat, C., & Thiele, A. (2014). Recent advances in robust optimization: Shang, C., & You, F. (2019a). Data analytics and machine learning for smart process
An overview. European Journal of Operational Research, 235(3), 471–483. manufacturing: Recent advances and perspectives in the big data era. Engineer-
Goh, J., & Sim, M. (2010). Distributionally robust optimization and its tractable ap- ing, 5(6), 1010–1016.
proximations. Operations Research, 58(4-part-1), 902–917. Shang, C., & You, F. (2019b). A data-driven robust optimization approach to sce-
Grant, M., Boyd, S., & Ye, Y. (2008). CVX: Matlab software for disciplined convex nario-based stochastic model predictive control. Journal of Process Control, 75,
programming. 24–39.
Guzman, Y. A., Matthews, L. R., & Floudas, C. A. (2016). New a priori and a Shen, F., Zhao, L., Du, W., Zhong, W., & Qian, F. (2020). Large-scale industrial energy
posteriori probabilistic bounds for robust counterpart optimization: I. Un- systems optimization under uncertainty: A data-driven robust optimization ap-
known probability distributions. Computers & Chemical Engineering, 84, 568– proach. Applied Energy, 259, 114199.
598.
1017
B. Han, C. Shang and D. Huang European Journal of Operational Research 292 (2021) 1004–1018
Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale mul- Yuan, Y., Li, Z., & Huang, B. (2016). Robust optimization under correlated uncer-
tiple kernel learning. Journal of Machine Learning Research, 7(Jul), 1531– tainty: Formulations and computational study. Computers & Chemical Engineer-
1565. ing, 85, 58–71.
Soyster, A. L. (1973). Convex programming with set-inclusive constraints and appli- Zhang, Q., Grossmann, I. E., Sundaramoorthy, A., & Pinto, J. M. (2016a). Data-driven
cations to inexact linear programming. Operations Research, 21(5), 1154–1157. construction of convex region surrogate models. Optimization and Engineering,
Suzuki, T., & Tomioka, R. (2011). SpicyMKL: A fast algorithm for multiple kernel 17(2), 289–332.
learning with thousands of kernels. Machine Learning, 85(1–2), 77–108. Zhang, Q., Morari, M. F., Grossmann, I. E., Sundaramoorthy, A., & Pinto, J. M. (2016b).
Wiesemann, W., Kuhn, D., & Sim, M. (2014). Distributionally robust convex opti- An adjustable robust optimization approach to scheduling of continuous indus-
mization. Operations Research, 62(6), 1358–1376. trial processes providing interruptible load. Computers & Chemical Engineering,
Xu, X., Tsang, I. W., & Xu, D. (2013). Soft margin multiple kernel learning. IEEE Trans- 86, 106–119.
actions on Neural Networks and Learning Systems, 24(5), 749–761. Zhang, Y., Jin, X., Feng, Y., & Rong, G. (2018). Data-driven robust optimization under
Yanıkoğlu, İ., Gorissen, B. L., & den Hertog, D. (2019). A survey of adjustable robust correlated uncertainty: A case study of production scheduling in ethylene plant.
optimization. European Journal of Operational Research, 277(3), 799–813. Computers & Chemical Engineering, 109, 48–67.
1018