Theory of Global Random Search
Theory of Global Random Search
Managing Editor:
M. HAZEWINKEL
Centre/or Mathematics and Computer Science, Amsterdam, The Netherlands
Editorial Board:
Vo!ume65
Theory of
Global Random Search
by
Anatoly A. Zhigljavsky
Leningrad University, U.S.S.R.
Edited by
J. Pinter
ISBN 978-94-010-5519-2
'Et moi, ...• si j'avait su comment en revenir. One service mathematics has rendered the
je n'y serais point aIle.' human mee. It has put common sense back
Jules Verne where it belongs, on the topmost shelf next
to the dusty canister labelled 'discarded non-
The series is divergent; therefore we may be sense'.
able to do something with it. Eric T. Bell
O. Heaviside
Mathematics is a tool for thought. A highly necessary tool in a world where both feedback and non-
linearities abound. Similarly, all kinds of parts of mathematics serve as tools for other parts and for
other sciences.
Applying a simple rewriting rule to the quote on the right above one finds such statements as:
'One service topology has rendered mathematical physics .. .'; 'One service logic has rendered com-
puter science ...'; 'One service category theory has rendered mathematics .. .'. All arguably true. And
all statements obtainable this way form part of the raison d'etre of this series.
This series, Mathematics and Its Applications, started in 1977. Now that over one hundred
volumes have appeared it seems opportune to reexamine its scope. At the time I wrote
"Growing specialization and diversification have brought a host of monographs and
textbooks on increasingly specialized topics. However, the 'tree' of knowledge of
mathematics and related fields does not grow only by putting forth new branches. It
also happens, quite often in fact, that branches which were thought to be completely
disparate are suddenly seen to be related. Further, the kind and level of sophistication
of mathematics applied in various sciences has changed drastically in recent years:
measure theory is used (non-trivially) in regional and theoretical economics; algebraic
geometry interacts with physics; the Minkowsky lemma, coding theory and the structure
of water meet one another in packing and covering theory; quantum fields, crystal
defects and mathematical programming profit from homotopy theory; Lie algebras are
relevant to filtering; and prediction and electrical engineering can use Stein spaces. And
in addition to this there are such new emerging subdisciplines as 'experimental
mathematics', 'CFD', 'completely integrable systems', 'chaos, synergetics and large-scale
order', which are almost impossible to fit into the existing classification schemes. They
draw upon widely different sections of mathematics."
By and large, all this still applies today. It is still true that at first sight mathematics seems rather
fragmented and that to find, see, and exploit the deeper underlying interrelations more effort is
needed and so are books that can help mathematicians and scientists do so. Accordingly MIA will
continue to try to make such books available.
If anything, the description I gave in 1977 is now an understatement. To the examples of
interaction areas one should add string theory where Riemann surfaces, algebraic geometry, modu-
lar functions, knots, quantum field theory, Kac-Moody algebras, monstrous moonshine (and more)
all come together. And to the examples of things which can be usefully applied let me add the topic
'finite geometry'; a combination of words which sounds like it might not even exist, let alone be
applicable. And yet it is being applied: to statistics via designs, to radar/sonar detection arrays (via
finite projective planes), and to bus connections of VLSI chips (via difference sets). There seems to
be no part of (so-called pure) mathematics that is not in immediate danger of being applied. And,
accordingly, the applied mathematician needs to be aware of much more. Besides analysis and
numerics, the traditional workhorses, he may need all kinds of combinatorics, algebra, probability,
and so on.
In addition, the applied scientist needs to cope increasingly with the nonlinear world and the
extra mathematical sophistication that this requires. For that is where the rewards are. Linear
models are honest and a bit sad and depressing: proportional efforts and results. It is in the non-
vi SERIES EDITOR'S PREFACE
linear world that infinitesimal inputs may result in macroscopic outputs (or vice versa). To appreci-
ate what I am hinting at: if electronics were linear we would have no fun with transistors and com-
puters; we would have no TV; in fact you would not be reading these lines.
There is also no safety in ignoring such outlandish things as nonstandard analysis, superspace
and anticommuting integration, p-adic and ultrametric space. All three have applications in both
electrical engineering and physics. Once, complex numbers were equally outlandish, but they fre-
quently proved the shortest path between 'real' results. Similarly, the first two topics named have
already provided a number of 'wormhole' paths. There is no telling where all this is leading -
fortunately.
Thus the original scope of the series, which for various (sound) reasons now comprises five sub-
series: white (Japan), yellow (China), red (USSR), blue (Eastern Europe), and green (everything
else), still applies. It has been enlarged a bit to include books treating of the tools from one subdis-
cipline which are used in others. Thus the series still aims at books dealing with:
- a central concept which plays an important role in several different mathematical and I or
scientific specialization areas;
- new applications of the results and ideas from one area of scientific endeavour into another;
- influences which the results, problems and concepts of one field of enquiry have, and have had,
on the development of another.
A very large part of mathematics has to do with optimization in one form or another. There are
good theoretical reasons for that because to understand a phenomenon it is usually a good idea to
start with extremal cases, but in addition - if not predominantly - all kinds of optimization prob-
lems come directly from practical situations: How to pack a maximal number of spare parts in a
box? How to operate a hydroelectric plant optimally? How to travel most economically from one
place to another? How to minimize fuel consumption of an aeroplane? How to assign departure
gates in an aeroport optimally? etc., etc. This is perhaps also the area of mathematics which is most
visibly applicable. And it is, in fact, astonishing how much can be earned (or saved) by a mathemat-
ical analysis in many cases.
In complicated situations - and many practical ones are very complicated - there tend to be
many local extrema. Finding such a one is a basically well understood affair. It is a totally different
matter to find a global extremum. The first part of this book surveys and analyses the known
methods for doing this. The second and main part is concerned with the powerful technique of ran-
dom search methods for global extrema. This phrase describes a group of methods that have many
advantages - whence their popularity - such as simple implementation (also on parallel processor
machines) and stability (both with respect to perturbations and uncertainties) and some disadvan-
tages: principally relatively low speed and not nearly enough theoretical background results. In this
last direction the author has made fundamental and wide ranging contributions. Many of these
appear here for the first time in a larger integrated context.
The book addresses itself to both practitioners who want to use and implement random search
methods (and it explains when it may be wise to consider these methods), and to specialists in the
field who need an up-to-date authoritative survey of the field.
The shortest path between two truths in the Never lend books, for no one ever returns
real domain passes through the complex them; the only books I have in my library
domain. are books that other folk have lent me.
J. Hadamard Anatole France
La physique ne nous donne pas seulement The function of an expert is not to be more
l'occasion de resoudre des problemes ... eIle right than other people, but to be wrong for
nous fait pressentir la solution. more sophisticated reasons.
H. Poincare David Butler
PREFACE ..................................................................................... xv
4.6.1. Random distributions neutral to the right and their properties ... 179
4.6.2. Bayesian testing about quantiles of random distributions ........ 182
4.6.3. Application of distributions neutral to the right to construct
global random search algorithms ................................... 182
7.1. Statistical inference when the tail index of the extreme value
distribution is known .......................................................... 239
7.2. Statistical inference when the tail index is unknown ....................... 259
7.2.1. Statistical inference for M ............................................ 259
7.2.2. Estimation of a. ........................................................ 262
7.2.3. Construction of confidence intervals and statistical hypothesis
test for a. ............................................................... 265
The investigation of any computational method in such a fonn as it has been realized on a
certain computer is extremely complicated. As a rule, a simplified model of the method
(neglecting, in particular, the effects of rounding and errors of some approximations) is
being studied. Frequently, stochastic approaches based on a probabilistic analysis of
computational processes are of great efficiency. They are natural, for instance, for
investigating high-dimensional problems where the detenninistic solution techniques are
often inefficient. Among others, the global optimization problem can be cited as an
example of problems, where the probabilistic approach appears to be very fruitful.
The English version of the book by Professor Anatoly A. Zhigljavsky proposed to the
Reader is devoted primarily to the development and study of probabilistic global
optimization algorithms in which the evolution of probabilistic distributions corresponding
to the computational process is studied. It seems to be the first time in the literature that
rigorous results grounding and optimizing a wide range of global random search
algorithms are treated in a unified manner. A thorough survey and analysis of the results
of other authors together with the clearness of the presentation are great merits of the
book.
A. Zhigljavsky is one of the representatives of the Leningradian theoretical
probabilistic school studying Monte Carlo methods and their applications. In spite of his
young age, he also participates in writing well-known monographs on experimental
design theory and, besides, on detection of abrupt changes of random processes.
Certainly, the book is going to be interesting and useful for a great number of
mathematicians dealing with optimization theory as well as users employing optimization
methods for solving various applied problems.
Many sections of the book can be read almost independently. A large part of its
exposition is intended not only for theoreticians, but also for users of the optimization
algorithms.
The Russian variant of the book published in 1985 was considerably revised. Thus,
Chapters 1-3 and Section 4.4 were completely rewritten; various recent results of other
authors were reconsidered. On the other hand, the exposition of a number of original
results of minor importance for global random search theory was reduced.
I am indebted to Professor S.M. Ermakov for his valuable influence on my scientific
interests. I am grateful to Professor A. Zilinskas for his help in reviewing the current state
of global optimization theory and to Dr. Pinter for careful reading of the manuscript that
had led to its substantial improving. I also wish to thank Professor G.A. Mikhailov and
Professor F. Pukelsheim for many helpful discussions on the subject of Section 8.2 and
E. P. Andreeva for the help in translation and for the careful typing of the manuscript.
Anatoly A. Zhigljavsky
Leningrad University
USSR
LIST OF BASIC NOT ATIONS
min f(x)
XE 'X
the same concerns the operations inf, max, sup
f g(x)dx is a short-hand notation for
f g(x)dx
'X
z* = arg min g( z)
XEZ
g(z*) = ming(z)
ZEZ
the same concerns the operation max
x*=arg min f(x) or x*=arg max f(x) (depending on the context)
is the indicator of a set A, 1 (XE A} is the indicator of an event (XE A}, i.e.
if xeA
lA(x)= 1 {xeA} {~ if xi!OA
The most common form of the global optimization problem is the following. Let 'X be a
set called the feasible region and f: 'X ~R) be a given function called the objective
function, then it is required to approximate the value
M= sup f(x)
xe'X
is to be approximated, can be treated analogously and can obviously be derived from the
minimization problem by substituting -f for f.
A procedure of constructing a sequence (xk) of points in X converging to a point in
which the objective function value equals or approximates the value f* is called a global
minimization method (or algorithm). The types of convergence may be different, e.g.
convergence with respect to values of f or convergence with some probability. As in
practice the amount of computational effort is always restricted, hence only a finite
subsequence of (xk) can be generated. For constructing these, one usually tries to reach a
desirable (or optimal) accuracy, spending the smallest possible (or a bounded)
computational effort.
Depending on the statement of the optimization problem (1.1.1), prior information f
and 'X, as values of f (and perhaps, its derivatives) at previous points of the sequence
(and occasionally also at some auxiliary points) may be used for constructing a search
point sequence (xk). Sometimes it is supposed that f is some regression function: this
way, the evaluations of f may be subjected to random noise.
2 Chapter 1
As a rule, the global optimization problem is stated in such a way that a point x*e 'X
exists in which the minimal value /* is attained. We shall call an arbitrary point x*e 'X
with the property f(x*)=f* a global minimizer and denote it by
or, more simply, by x*=arg min 1- In general, this point is not necessarily unique.
As a rule, the initial minimization problem is stated as the problem of approximating a
point x* and its value /*. Sometimes the value /* is required only, but a point x* is not:
naturally such problems are somewhat simpler.
Approximation of a point x*=arg min f and the value f* is usually interpreted as
finding a point in the set
where p is a given metric on 'X, 0 and E determine the accuracy with respect to function
(or argument) values.
The complexity of the optimization problem is determined mainly by properties of the
feasible region and the objective function. There is a duality concerning the properties of
'X and f. If explicit forms of f and 'X are known and f is complicated, then the
optimization problem can be reformulated in such a way that the objective function
transforms into a simple one ( for instance, linear) but the feasible region becomes
complicated. The opposite type of reformulation is possible, too. Usually global
optimization problems with relatively simple-structured sets 'X are considered (as they are,
in general, easier to solve even if the objective function is complicated).
Unlike the local optimization problems, a global one can not be solved in general, if 'X
is not bounded. The boundaries of 'X correspond to the prior information concerning the
location of x*. The wider boundaries 'X has, the larger is the uncertainty of the location of
x* (i.e. the more complicated is the problem). Supposing the closedness of 'X and the
continuity of f in a neighbourhood of a global minimizer we ensure that x*e 'X, i.e. the
global minimum of f is attained in 'X.
Typically, 'X can be endowed with a metric p, but usually there are many different
metries and there is no natural way of choosing a representative amongst them. The
ambiguity of the metric choice is connected, for instance, with the scaling of variables.
The properties of the selected metric influence the features of many optimization
algorithms: therefore its selection must be performed carefully. In the case 'XcRn it is
supposed that p is the Euclidean metric unless otherwise stated.
Global Optimization Theory 3
Some of these algorithms may be reformulated for a wider class of sets. But even in case
of a hyperrectangle
(1.1.6)
such a reformulation may be ambiguous and may require a lot of care (as the metric on the
cube differs from the metric induced by the corresponding metric on a hyperrectangle after
transforming it into the cube). There are classes of algorithms (in particular, random
search algorithms) which require only weak structural restrictions concerning the set X.
Below either the type of X is explicitly indicated or its structure is supposed to be simple
enough. In many cases it suffices that 'X is compact, connected, and it is the closure of its
interior (the last property guarantees that Jln( 'X»O and Jln(B(x,£»>O for all xeX, £>0).
Sometimes the set 'X is defined by constraints and has a complicated structure. In such
cases the initial optimization problem is usually reduced to the problem on a set of simple
structure by means of standard techniques worked out in local optimization theory
(namely, penalty functions, projections, convex approximation and conditional directions
techniques). Special algorithms for such problems can also be created, see Sections 6.1
and 2.3.4.
The degree of complexity of the main optimization problem is partially determined by
properties of the objective function. Typically, one should select a functional class f'
which f belongs to, before selecting or constructing the optimization algorithm. In
practice, f' is determined by prior information concerning f. In theory, the setting of f'
corresponds to the choice of a model of f. The wider f' is, the wider the class of allowable
practical problems is and the less efficient the corresponding algorithms are.
The widest reasonable functional class f' is the class of all measurable functions. It is
too wide, however, and thus unsuitable in modelling global optimization problems. In a
sense, the same is true for the class f' == C(X) of all continuous functions (and for classes
4 Chapter 1
of continuously differentiable functions Cl( X ), C2( X ), etc. as well): this results from
the existence of such two continuous (or differentiable) functions whose values coincide at
any fixed collection of points but their minima may differ by any magnitude. On the other
hand, the class of uniextrernal functions is too narrow because the corresponding extremal
problems can be successfully treated in the frames oflocal optimization theory.
Unlike in local optimization, the efficiency of many global optimization algorithms is
not much influenced by the possibility and computational demand of evaluation/estimation
of the gradient \11 or the Hessian \121 of 1, since the main aim of a global optimization
strategy is to find out global features of 1 (while the smoothness characterizes the local
features only).
Naturally, the computational demand of evaluating the derivative of 1 influences the
efficiency of a global optimization strategy only as much as a local descent routine is a part
of the strategy, see Section 2.1.
The computational demand of evaluating 1 is of great significance in constructing
global optimization algorithms. If this demand is high, then it is worth to apply various
sophisticated procedures at each iteration for receiving and using information about f. If
this demand is small or moderate, then the simplicity of programming and the absence of
complicated auxiliary computations are characteristics of great importance.
Global optimization problems in presence of random noise concerning evaluations of f
are even more complicated. Only a small portion of the known global optimization
algorithms may be applied to this case. Such algorithms will be pointed out individually;
hence, it will usually be supposed that 1 is evaluated without noise.
The selection of a global optimization algorithm must be also based on a desired
accuracy of a solution. If the accuracy required is high, then a local descent routine has to
be used in the last stage of a global optimization algorithm, see Section 2.1.1. Some
methods exist (see Section 2.2) in which the accuracy is an algorithm parameter. It should
be indicated in this connection that under a fixed structure of an algorithm, the selection of
its parameters must be influenced by a desired accuracy. For example, Section 3.3
describes an algorithm whose order of convergence rate will be twice improved, if the
parameters of the algorithm are chosen depending on the accuracy.
Concluding this section, let us formulate a general statement of the optiMization
problem which will be taken automatically if opposite or additional details are not mserted.
Let the compact set 'Xc:Rn (n ~ 1) have a sufficiently simple structure; furthermore, let a
bounded (from below) function 1: 'X-7:R 1 belonging to a certain functional class :r be
given. It is then required to approximate (perhaps, with a given accuracy) a point x* and
the value 1* by using a finite number of evaluations of 1 (without noise).
Global Optimization Theory 5
The role of prior information - conceming the objective function f and the feasible region
Xc:R.n - in choosing a global optimization algorithm is hard to overestimate. Having
more information, more efficient algorithms can be constructed. Section 1.1 described the
information that is considered as always available. But it does not suffice to construct an
algorithm solving an extremal problem with a given accuracy over a finite time. Therefore
it is worth of taking into account the more specific properties of the objective function
while constructing global optimization algorithms.
There exist various types of prior information about f determining functional class f'.
Usually f' is determined by some conditions on f : a number of typical cases are listed
below.
a) f'CC(X).
a') f'cCl(X).
a"') f'cC2(X).
e) f is a rational function.
g) f is concave.
(12.1)
(1. 2.2)
is valid.
m) The measure of domains of attraction of all local minimizers is not less than
a fixed value ')'>0.
m ') The measure of the domain of attraction of a global minimizer is not less
than a fixed value ')'>0.
mil) Iln{A(8)} /Iln {'X}~where 8 and "( are given positive values, Iln is the
Lebesgue measure, and A( 8 ) is defined by (1.1.3).
n ') There exists such ()0>0 that for any ()E (0, ()o) the set A( (» is connected
and
Iln (A«»}>O.
p) There exist positive numbers e, 13, cl' c2 such that the inequalities
numerical characteristics of the algorithms are greatly influenced by the choice of p and L,
see Section 2.2.
Assumptions a) - j) are often supposed to hold, when constructing and investigating
detenninistic global optimization algorithms which do not contain random elements. These
algorithms are shortly described in Sections 2.1, 2.2, 2.3.
No fewer number of suggestions are encountered for constructing and investigating
probabilistic algorithms, based on diverse statistical models of the objective function or of
the search procedure. Probabilistic models of f E f' are the basis of many Bayesian,
information-statistical and axiomatically constructed algorithms described mainly in
Section 2.4 and using the conditions b), f), k) - 1) and some others. Probabilistic models
of the search process correspond to global random search algorithms: their investigation
may be closely related to conditions m) - q), k'), d), f), h), h') and some others.
Information concerning these suppositions and cases of using them is contained in
Table 1 of the next section and in the following chapters of the book.
This section contains a table classifying the principal approaches and methods of global
optimization as well as some additional explanation.
Certainly, it is impossible to enumerate all global optimization methods in an
unambigous fashion: Table 1 provides a possible classification. Not all notations used are
common: alternative terms for these and related methods may be found in the
corresponding sections and references.
The first two columns of Table 1 give the names of approaches and methods (or
groups of methods). The first five approaches mostly include deterministic methods. A
family of probabilistic methods includes the methods of the last two approaches, together
with methods based on smoothing the objective function and screening of variables,
random direction methods, many commonly used versions of multi start and the candidate
points method, as well as some versions of a number of other methods. Let us note again
that the classification of approaches and methods in Table I is somewhat arbitrary, since
some methods (in particular, random covering, random multi start, polygonal line methods
and the method based on smoothing the objective function) may be in various groups.
The third column of Table 1 gives typical conditions (from among those described in
Section 1.2.1) imposed on f for construction and investigation of methods. It should be
noted that the condition coflections for the majority of methods are not full or accurate
since different versions of some methods require slightly different suggestions, and not
for all methods are known the precise conditions ensuring their convergence.
According to Section 1.1, the feasible region X is mostly assumed to be a compact
subset ofR..n, n~l, having a nonzero Lebesgue measure and a relatively simple structure,
unless additional details are given. The fourth column of Table 1 contains such details
required for the realization of the corresponding method.
The fifth column provides the number of the section describing or studying the
method. There are many corresponding references in these sections. Of course, it is
impossible to mention all works devoted to global optimization and, as intended, those
works are referred that contain much information related to the corresponding subject.
The sixth and seventh columns reflect the state of theoretical and numerical foundation
of the methods. Under theoretical basis of a method we mean the existence of theoretical
results on convergence, rate of convergence, optimality, and decision accuracy depending
Global Optimization Theory 9
Zang and Avriel (1975) state that every local minimum of f is a global minimum of f if
and only if the point-to-set mapping Lr RI -til is lower semicontinuous at any point
yE G. The lower semicontinuity of Lf at a point yE G means that if XE Lf (y), {Yi}cG,
Yi-ty (i-too) then there exist a natural number K and a sequence {xiJ such that
xiELf (Yi), i~K, and xi-tx(i-t oo).
Tne next result of Gabrielsen (1986) is closely connected with the preceding one.
Based on the multistart a) ·a· V ). 2.1.3 elementary different for in the pure form is
useofloca1 h) ·h'), various versions used very seldom
search m) ·m")
techniques
based on solution of a" ) 2.1.5 different for different for efficient versions
differential equations a'" ) various various exist for some
versions versions classes of problems
based on smoothing 0), i), 2.1.6 few no results
poor showing on high
the objective probably
function q) efficiency
g
.§
1\
..,
.....
c;'l
<S'"
[
Table I. cont.
~
2 3 4 5 6 7 8 §.
N'
Branch based on use e' ) hyperrectangle 2.3.4 relatively small complicated for
and bound of convex minorants high realization, for
methods and concave majorants each objective
function requires
a separate
construction
interval e' ), probably hyperrectangle 2.3.4 relatively different for complicated
q) high various versions realization in the
multidimensional
case
concave minimi- d" ), convex, 2.3.4 relatively different for the feasible region
zation on a convex polyhedron high various versions may have a high
set g), g'),
dimension
i" )
-
N
Table I. cont.
2 3 4 5 6 7
Based on coordinate-wise a), b) small analogous to
dimension 2.3.2 poor
minimization the corresponding
reduction local minimization
algorithm
random directions b), d') 2.3.2 poor sufficient inefficient
except the case
d')withp:54
multistep dimension b), 2.3.2 relatively different for efficient for some
reduction j ), j') high various versions panicular cases
( for instance,
casej) )
based on use of b) n different for no results
2.3.2 relatively
Peano curve type xqo,l] various versions showing high
high
mappings efficiency
in the case n 2: 3
based on hyperrectangle 2.3.2 poor small there are examples
i'), a)
screening of efficient solu-
of variables tion of complicated
problems
Based on appro- based on 2.3.3 relatively sufficient some versions are
f), aV )
ximation and approximation high optimal, but numerical
integral repre- of objective results indicate poor
sentations function efficiency
based on integral m J, n J 2.3.3 dilierent for small no resUlts showmg
representation 0), p) various versions high efficiency
random multistart a) -a' V), 2.1.3 relatively sufficient the main version
4.5 high of multistart
h) -h'),
m) -m" )
branch and 0), p), p'), 4.3 relatively sufficient there are a number of
probability sometimes high variants, some of them
bound d), k), k'), proved to be highly
n), m"), efficient
q)
generation n '), 0), Ch.5 different for sufficient inefficient for simple
methods q) and various versions problems, but
others promising for
complicated
multidimensional
problems
.....
w
14 Chapter 1
criterion for verifying the uniextremality of the sum of deviation squares (that determine
the least squares estimators) and other functions arising in identification of nonlinear
stochastic models. The last mentioned work also generalized the above criterion to the case
where the set 'X={XE:R.ll: f(x)<c} is not connected and investigated the property of local
uniextremality, i.e. the uniextremality of f on the connected parts of 'X.
Global Optimization Theory 15
(1.3.1)
for n=4 and m=5 where ai and ci (i=l, ... ,m) are fixed vectors and numbers.
16 Chapter 1
x. E [- 1, 1]\{O}, (13.2)
1
which has a countable infinity of local minimizers. The work on standardizing of global
optimization test function classes is being investigated by other researchers, thus we shall
stop here further discussion of the topic.
The third reason for doubtfulness of numerical comparison is due to the existence of
various efficiency criteria used in global optimization. The most common efficiency
criterion is the time (expressed in standard units) required to reach a given solution
accuracy. Sometimes the number of objective function evaluations is used instead of time.
Reliability is the second important criterion that is very difficult to estimate in complicated
setups. The simplicity of programming and the computer memory required are some other
quality criteria of global optimization algorithms.
It follows from the above that numerical comparison studies per se are hardly able to
measure the efficiency of global optimization algorithms appropriately and are not able to
satisfy all requirements of the users. In many cases, a rigorous theoretical study gives
more information on the efficiency. Let us indicate below several efficiency criteria for the
theoretical comparison of global optimization algorithms.
(ii) Speed of convergence. It is studied only for some groups of algorithms and serves
mainly for comparison of algorithms whithin these groups since the variety of types of
convergence speed and methods of their estimation is great (much more than that of the
convergence domains).
(1. 3.3)
N N
£(d ) = sup £(f,d ). (1.3.4)
fEf'
If it is possible to determine a measure A(df) on the functional class:F, then one can use
the inaccuracy
instead of (1.3.4). For probabilistic algorithms of Section 2.4, the mean value E{ £ (dN ) }
is usual for measuring in accuracy, replacing £ (dN ) defined by (1.3.4) or (1.3.5).
Inaccuracies of type (1.3.4) or (1.3.5) are important efficiency characteristics of global
optimization algorithms, but their utility range is very restricted since they can be evaluated
only for a small part of methods dN and functional classes:F.
(iv) Optimality. According to the ordinary definition of optimality (in the minimax sense),
an algorithm is called optimal, if it minimizes the inaccuracy (1.3.4) for a fixed N or the
number N for a fixed value of £(dN ). For some functional classes :F (of the type
F=Lip(X,L,p)) optimal methods exist and can be constructed. Various numerical results
and theoretical studies show, however, that the above optimality property gives almost
nothing from the practical point of view. This is connected with the fact that the minimax-
optimal method gives good result for the worst function from a given class:F, but for an
ordinary one it may be much worse than some other algorithm.
The use of the inaccuracy measure (1.3.5) leads to the Bayesian concept of optimality.
Some other concepts of optimality (stepwise, Bayesian stepwise, asymptotical,
asymptotical by order, etc.) also exist, but the observation that theoretical optimality of a
particular method does not guarantee its practical efficiency, is valid again.
Many works in global optimization theory are devoted to the problem of optimal
choice of algorithm parameters or their components: in essense, these works consider the
optimality of parameters over narrow classes of algorithms.
All the abovesaid leads to the conclusion that even if a method may be theoretically
investigated, the comparison of its efficiency to the efficiency of a method belonging to
another approach is complicated and hardly formalizable because of the great variety of
algorithm characteristics and optimization circumstances. Further problems arise when
heuristic methods are investigated, see Zanakis and Evans (1981).
In conclusion, while studying the efficiency of a global optimization method, it is
worthwhile to use a composite approach including the determination of a class of
optimization problems that is being solved by the method, the investigation of theoretical
18 Chapter 1
properties of the method, and its numerical study. We shall mainly be concerned with the
theoretical study in this book.
It is hard to find branches of science or engineering that do not induce global optimization
problems. Many such problems arise e.g. in the following fields: optimal design,
construction, identification, location, pattern recognition, control, experimental design,
etc. Instead of detailed description of the corresponding classes of optimization problems
or particular ways of their solution we refer to a number of papers in the Journal of
Optimization Theory and Applications, and also to Dixon and Szeg6, eds. (1975, 1978),
Batishev (1975), Mockus (1967), Zilinskas (1986), Zhigljavsky (1987) and make a few
additional comments.
Experimental design theory (see Ermakov et al. (1983), Ermakov and Zhigljavsky
(1987)) leads to a wide class of complicated multiextremal problems: some of them
occasionally regarded as test ones. For instance, Hartley and Rund (1969), Ermakov and
Mitioglova (1977), and Zhigljavsky (1985) used the function
f (x)=det Ill,x.,x6
1
.,x~,x.x6
+1 1 1
.,x 26+.11
+1 1.
6 (1.3.6)
1=1
(1.3.7)
on the square (x,y) E [1,6]x[I,6]. The function (1.3.6) is complicated to optimize, since it
has several thousands of local maximizers, it has a great number of global maximizers,
and its relatively high dimension eliminates the possibility of using most of the
deterministic methods. Note that one can use the feasible region
instead of (1.3.7), furthermore, that test problems similar to (1.3.6) - (1.3.7) were treated
in Bates (1983), Bohachevsky et al. (1986), Haines (1987).
The more capacity contemporary computers have, the greater is the actuality of
optimization in simulation models. These are optimization problems of functions subjected
to a random noise that can be controlled. The optimization problems of mathematical
models with an objective function solving a differential/integral equation are closely related
to the above-mentioned ones. Their main peculiarity is connected with the nonrandomness
of the noise that can be controlled as well.
Global Optimization Theory 19
According to the classification of Section 1.2.2 this chapter describes some principal
approaches and methods of global optimization except the global random search methods
which Part 2 is devoted to. We shall consider the minimization of the objective function f
(belonging to a given functional set:F and evaluated without a noise) on the feasible region
'X (a compact subset ofRP, having a sufficiently simple structure).
Section 2.1 deals with global optimization algorithms based on the use of local
methods. They are widespread in practice, but many of them are not theoretically
investigated to a considerable extent.
Section 2.2 is devoted to the covering methods including passive grid searches. The
situation here is opposite to the above: the covering methods are thoroughly studied
theoretically, but their practical importance is not great.
Section 2.3 considers one-dimensional optimization algorithms, reduction and
partition techniques. The most attractive of them are the branch and bound methods.
Under various (generally, substantial) prior information about f they combine the practical
efficiency with theoretical elegance.
Section 2.4 treats the approach based on stochastic and axiomatic models of the
objective function. Here the balance of the theoretical and practical significance of the
algorithms does not hold. Most of them are cumbersome and time-consuming when being
realized.
20
Global Optimization Methods 21
k=I,2, ...
where xIEX is an initial point, sk is the search direction, and 'Yk~0 is the step-length.
Local minimization algorithms differ in the way of constructing {'Yk} and {skI that
usually results in descent (or relaxation) algorithms for which the inequalities
f(Xk+l).::;f(Xk) hold for all k=I,2, .... To this end, it is necessary to find {SkI such that
sk'Vf(Xk)<O for each k=I,2, ... and to choose 'Yk: in a suitable way. Two such ways of
selecting'Yk are the most well-known. The first one is to choose
(2.1.1)
where f k('y)=f (xk+')'Sk) is the one-dimensional function determined by f, Xk and sk. This
way is primarily of theoretical significance.
The second way is as follows. Let Y1>0 and /3>1 be some real values, 'Yk-l be the
preceding step-length, sk be the search direction at xk. If the inequality
(2.1.2)
22 Chapter 2
where
The direct search algorithms usually apply the bisection method for determination of step-
lengths. At each k-th iteration of many search algorithms, that of the two directions +e or-
e is taken as sk for which f decreases. In the coordinate-wise search algorithms e belongs
to the collection {q, ... ,e n } of the unit coordinate vectors. The cyclic coordinate-wise
descent selects ei sequentially, but the random coordinate-wise search does it at random.
The random search algorithm with uniform trial draws e as a realization of a random
vector uniformly distributed on the unit sphere S={XE Rn: II x-xkll =l}. In random
searches with learning, e is a realization of a random vector whose distribution depends on
the previous evaluations of f. The random m-gradient method (1 <m<n) selects
m
Sk = - i:1 q £f (x k + aq) - f (x k) ] / a
where a is a small positive value and ql, ... ,qm are orthonormalized vectors constructed
by means of the orthogonalization procedure from m independent realizations of random
vectors uniformly distributed on S.
A thorough description and investigation of local optimization technique can be found
in many textbooks, see, for instance, Avriel (1976), Dennis and Schnabel (1983) or
Fletcher (1980). Among others Fedorov (1979), Demyanov and Vasil'ev (1985),
Mikhalevitch et al. (1987) treated the nondifferentiable case. It should be noted that the
local optimization routines available in contemporary software packages are suitable for
most practical needs.
Ordinarily local algorithms are aimed at locating a global optimizer x* making more
precise the approximation of f*= f(x*). To do this, one should apply a local descent
routine starting from an initial point x(o) that is an approximation of x* as obtained by a
global optimization method.Roughly speaking, in that case the aim of the global stage is
the localization of x* and the aim of the local one is finding the precise location of x*.
Applying the above approach, it is necessary to assume that the point x(o) belongs to the
neighbourhood of x* in which f is sufficiently smooth and has only one local minimum.
The efficiency of a considerable number of global minimization methods depends on
closeness of the values fk * and f* for all k~1. In these methods it is natural to make some
iterations of a local descent right away after obtaining a new record point.
Occasionally, if iterations of a local descent are easily computable, then it is
advantegous to start them from several points obtained at the global stage. This problem
for random optimization technique is discussed later in Section 3.2.
Note that if a local optimization algorithm is part of a global optimization strategy, then
it should be supposed that the objective function is subjected to some local conditions
besides the global ones. Moreover, it can be concluded from the cited properties of local
optimization algorithms (see Section 2.1.1) that it is especially advantageous to include
them into a global optimization strategy, if the derivatives of the objective function may be
evaluated without much effort.
2 .1.3 Multistart
The technique under consideration is concerned with the method recently called multi start
which has been historically the first and for a long time the only widely used method of
global optimization. This method consists of multiple (successive or simultaneous)
searches of local extrema starting at different initial points. The initial points are frequently
chosen as elements of a uniform grid (such grids are described later in Section 2.2.1)
To use multi start, one should impose suggestions on the number of local optimizers of
the objective function and also on its smoothness.
The main difficulty for practical use of multi start is the following: to obtain a global
optimizer with high reliability one should choose the number of initial points much greater
than the number of local optimizers (which is usually unknown). So the main part of
computer efforts would be taken for attaining local extrema repeatedly. If it is supposed
that the local optimizers of the objective function are rather far from each other ( a heuristic
suggestion), then the basic version of multi start is often modified by one of the two ways
described below.
The first version is to surround every evaluated local minimizer by a neighbourhood
and attribute the next property to it: if a point of search attains the neighbourhood then it
moves into the local minimizer. The proper use of the corresponding algorithm is possible
only if one is able to choose neighbourhoods which either are subsets of the
corresponding domain of attraction of the local minimizers or surely do not contain a
global optimizer. In the last case the global minimization algorithm is a covering method
and is studied in Section 2.2.
The second way is called the candidate points method and consists in simultaneous
local descents from many initial points and joining neighbouring points. The joining action
is equivalent to substituting such points by one of them (whose objective function value is
the smallest). The problem of joining belongs to cluster analysis; this way any clustering
method may be used for this purpose. One of the most simple and popular methods of
Global Optimization Methods 25
joining is the so-called nearest neighbour method based on the heuristic supposition that
the distance Pij=P(Zi,Zj) (in metric p) between points zi and Zj belonging to a
neighbourhood of a local minimizer is less than the distance between the points belonging
to the neighbourhoods of different minimizers.
1. Let K points Z1 "",zK be given. Assume that all of them belong to various
clusters, i.e. the number of clusters equals K (here the word cluster is associated
with neighbourhood of a local minimizer).
2. Find the nearest pair of points (Zi,Zj), i.e. such a pair for which
p .. = min P
IJ k ~=1 K,k~ k~
3. If the distance Pij between the nearest neighbours zi and Zj does not exceed a
certain small 8>0, then the points zi and Zj are incorporated and the corresponding
clusters are unified; hence the number of clusters K decreases by 1 (i.e. K-1 is
substituted for K).
4. If the distance between the nearest neighbours exceeds 8 or K=l (only one cluster
remains) then the algorithm stops. Otherwise go to Step 2.
1. Obtain N points x1, ... xN generating them uniformly on X (i.e. x1, ... xN are the
elements of a uniform random sample.
2. By performing several iterations of a local descent routine from initial points
x 1> ... xN obtain points z}, ... ,zN (the number of the local descent iterations from the
points Xi may depend on values of the objective function f(Xi), i=l, ... ,N).
3. Apply a cluster analysis method (for instance, Algorithm 2.1.1) to the points
zl>""zN' Let m be the number of clusters obtained.
4. Select representatives Xl ,... xm from the clusters ( a natural selection criterion is
the objective function value). If m=l, or the number of the local descent iterations
exceeds a fixed number then go to Step 5. Otherwise put N=m and go to Step 2.
5. Suppose we are in the neighbourhood of a global minimizer. In case of
necessity, apply a local optimization routine which has high speed of convergence in
the neighbourhood of an extremal point.
Algorithm 2.1.2 is heuristic, its efficiency depends on parameters and auxiliary methods
choice. The number N must be significantly greater than the expected number of local
26 Chapter 2
minima of the objective function. The choice of Algorithm 2.1.1 as a cluster anlysis
method and a local optimization routine available in a computer software is usual choice.
Note that quasirandom grids have almost the same prevalences as random ones for
selection of the initial points in candidate points methods.
Algorithm 2.1.2 and some similar methods (see Torn (1978), Spircu (1978), Batishev
and Lubomirov (1985)) are appealing in case of some complicated multiextremal
problems. The well-known method of Boender et al. (1982) is also based on the above
principles and proved to be efficient for solving a number of multivariate global
optimization problems. Its recent version goes under the name of multi-level single linkage
method and is summarized as follows.
Algorithm 2.1.3 (Multi-level single linkage method)
At each k-th iteration the next selection procedure is applied at Step 3 of Algorithm 2.1.3.
First, cut off the sample 2 and choose ykN point of 2 with the lowest function values
where yis a fixed number in (0,1]. Each chosen point x is selected as an initial point for a
local descent if it has not been used as an initial point at a previous iteration, and if there is
no neighbour sample point ZE 2nB(x,l1k) with lower function value where the critical
distance 11k is given by
(2.13)
Global Optimization Methods 27
after obtaining a record point X*, and then descending from the point x(o)'
A tunneling algorithm consists of two stages: a minimization stage and a tunneling
one. These stages are used sequentially. In the minimization stage a local descent routine
is used for finding a local minimizer x* of the objective function. In the tunneling stage,
the so-called tunneling function (it was called penalty or filled function as well) T(x) is
determined. This function attains a maximum (may be, a local maximum) at x*, has
continuous first derivatives at all points (probably, except x*) and depends on f, x*, and a
number of parameters that are automatically chosen by the algorithm. After constructing
the tunneling function, a point x(o) from the set (2.1.3) is sought for at the tunneling stage.
Then one proceeds to the minimization stage and finds a local minimizer of f starting at
x(o)' The obtained local minimizer is a new record point and the above iteration may be
repeated. The search stops if it does not succeed in finding a point X(o)E U( x*) in a given
number of algorithmic steps.
Vilkovet al (1975), Levy and Montalvo (1985) determined the tunneling function by
(2.14)
where a>O is a fixed parameter. If there are some record points x*(I), ... x*<n with the
same objective function value then Levy and Montalvo (1985) proposed to construct the
tunneling function by
(2.15)
(2.1. 6)
Ge and Qin (1987) modified (2.1.5) in different ways and presented five new tunneling
functions, viz.,
28 Chapter 2
2
- {p 10g[Tt + f(x)] + Ilx - X*liP}, p = 1,2
- (f(x)- f(x*» exp{p21Ix- x * liP}, p= 1,2.
To find a point X(o)EU(X*) in the tunneling stage needs performing a local minimization of
the tunneling function starting at an initial point Xo belonging to a rather small
neighbourhood of the record point x*. If the local minimizer of T does not belong to the
set (2.1.3) or the search trajectory goes out of 'X not attaining U(x*) then it is advisable to
change values of parameters (parameter a in case of (2.1.4) and parameters Tt, 13 in case
of (2.1.5» and to descend into a local minimizer of a modified function T. If after several
changes of values of parameters a point from the set U(x*) is not found then it is advisable
to perform the same actions using another initial point xo. For example, it is natural that
such points are sequentially selected from the collection of points
D(x*) is the radius of the set 8( x*), i.e. the shortest distance from x* to the boundary of
8( x*), D is the smallest radius of all subsets 8( x*), i.e.
Global Optimization Methods 29
(2.1.8)
hold, then the function (2.1.5) can not have any stationary point in the set
(2.1.9)
Proposition 2.1.2. If (2.1.6) holds and the ratio ~2/(T)+t<x*» is small enough to
assure the inequality
(2.1. 10)
then the function (2.1.5) has no local minimizers in the interior of 'X.
Moreover it is evident that if (2.1.6) holds then x* is a local maximizer of the function
(2.1.5).
Proposition 2.1.1 shows that if the ratio ~2/(T)+f(x*» is small enough (satisfies the
inequality (2.1.8» then the function (2.1.5) has no stationary points in the set (2.1.9).
Hence, any local descent algorithm applied to the function (2.1.5) and starting from a
neighbourhood of x* should either arrive to the boundary of'X or reach the set U( x*).
On the other hand, Proposition 2.1.2 yields that if this ratio is too small (satisfies the
inequality (2.1.10», then any local descent algorithm applied to the tunneling function
(2.1.5) should arrive to the boundary of 'X. Hence, in this case the aim of the tunneling
stage can not be reached and further search of a global minimizer of f would be
impossible.
The main difficulty of constructing a suitable version of the tunneling algorithm is
connected with the problem of choosing its parameters T),~ (or, respectively, a and ai).
Under their favourable choice the convergence of the algorithm may be ensured. Let us
now formulate the basic version of the Ge algorithm.
Algorithm 2.1.4. (Ge (1983».
1. Find a local minimizer x* of the function f starting at an arbitrary point xl; set k= 1.
2. Form the tunneling function T by (2.1.5) (for the modified tunneling functions
presented above the algorithm is analogous, see Ge and Qin (1987».
30 Chapter 2
4. Use a local descent routine for function f, starting at the point otitained at Step 3. Let
z* be the obtained local minimizer of / .
In Algorithm 2.1.4 the choice of a number K (defining the stopping criterion) and change
rules for the values of parameters 11, ~ are heuristic, but they influence considerably the
efficiency of the algorithm.
Some more information about the tunneling method realization may be found in Levy
and Montalvo (1985) and Ge and Qin (1987). The analysis of numerical results yields that
the tunneling method can not be referred to as a very efficient one (in particular, it does not
always succeed in finding the global minimizer). The main difficulties of realizing the
tunneling method are the following: the tunneling function under certain values of
parameters (ex. in (2.1.4) and 11, ~ in (2.1.5» are flat and close to zero in a considerable
part of X; local minimization of a tunneling function needs to be carefully performed in
order not to pass over a minimizer; trajectories of a local optimization of T often arrive the
boundary of X; the termination of the search is problematic (in essence, the stopping
problem is equivalent to the main optimization problem). Nevertheless, the tunneling
methods have only been recently created and the progress in the field of increasing their
numerical efficiency is quite probable.
shall suppose that the objective function f is sufficiently smooth (as a rule, twice
continuously differentiable).
A simple transition algorithm may be constructed by means of alternating descents to
local minimizers with ascents to local maximizers. As an initial direction of an ascent
(descent) the latter direction of the preceding descent (respectively, ascent) is natural to use
in this algorithm. Its disadvantage consists in the possibility of cycling at a collection of
some local minimizers and failing to reach a global minimizer. Another disadvantage is
that a lot of evaluations of the objective function and its derivatives are wasted on
investigating nonprospective subsets (in particular, on ascents to maximizers). The former
disadvantage may be removed by means of surrounding minimizers by ellipsoids in order
to prevent waste descents (Treccani et al. (1972». But already in the bivariate case the
above algorithm is rather complicated for realization (see Codes (1975» and in case of
greater dimensions the possibility of its efficient realization is indeed problematic.
Many global minimization methods of this section originate from investigating
properties of solutions of various differential equations. One of the first such attempts is
the heavy ball algorithm (Pshenichnij and Marchenko (1967» according to which the
search trajectory coincides with the trajectory of a ball motion on the surface generated by
the objective function. Globality of the heavy ball algorithm associated with that of a
moving ball by inertia may pass over flat hollows (but may stop in one of them).
The search trajectories for a general class of algorithms including the heavy ball
algorithm and algorithms from Zhidkov and Schedrin (1968), Incerti et al. (1979),
Griewank (1981) are discrete approximations of solutions of second order ordinary
differential equations having the form
where Il(t) and v(t) are functions of time t. According to classical mechanics the equation
(2.1.11) represents Newton's law for a particle of mass Il(t) in a potential f subject to a
dissipative force -v(t)x'(t). At the initial time t=to , a particle is at a point Xo and has a
motion direction zoo Given suitable assumptions on functions f, Il,V, any trajectory
converges to a local minimizer of f tending to pass over flat minima.
Let c be a certain upper bound for f*=min f (i.e. c~f*). Griewank (1981) showed
that the search trajectory obtained from (2.1.11) under
can not converge to a local minimizer with objective function value greater than c; further,
that a point X(o)E'X will be reached with the value f( X(o»~c. The latter algorithm seems
to be the most promising of the mentioned group.
Let us consider another class of global optimization algorithms based on solving
differential equations. The main representative of it is the method developed by Branin
(1972) and Branin and Ho (1972).
32 Chapter 2
Let f be twice continuously differentiable and assume that not only the values J(x) can
be evaluated but also the values of the gradient g(x)=V f(x) and the Hessian H(x)=V2
f(x) for all XE'X. Consider a system of simultaneous differential equations
subject to the initial condition g(x(O»=go where s is a constant taking on either value + 1
or -1. The solution of the system (2.1.12) has the form g(x(t»= goe st and if s=l, t~ - 0 0
or s=-I, t~oo it converges to g=O. It means that the trajectory corresponding to a solution
tends to a stationary point of f. Branin method consists of sequentially solving (2.1.12)
(in order to attain a stationary point of f) and alternating the sign of s (in order to pass over
from one stationary point to another). To solve the system (2.1.12) rewrite it as follows
-1
x'(t) = sH (x(t» g(x(t» (2.1.13)
Let A(x) be the adjoint matrix of H(x) which determines the inverse of the Hessian by the
formula H-l(x)=A(x)/det H(x). Then (2.1.1 3) may be replaced by the system
The latter is determined for all x. It is obtained from (2.1.13) by changing the time in case
det H(x):;t:O and by inverting the time while passing over to the points of neglecting det
H(x). If (2.1.14) is used then the sign of the constant s should be alternated when
attaining a stationary point as well as at a point of neglecting det H(x).
An ordinary method of numerically solving these first order differential equations is
based on the use of discrete approximation
where
. dx
x=-
dt'
is an arbitrary point from 'X,Yo,YI, ... is a certain sequence of nonnegative numbers.
Applying (2.1.15) to solve (2.1.13) we obtain
k=O,l, ...
that is the Newton method with a variable step length Yk. The system (2.1.14) may be
solved analogously.
studied a function f whose contours are topologically equivalent to spheres and the Branin
method for minimizing it may not converge. Anderson and Walsh (1986) proposed a
simply realized version of the Branin method for minimization of two-dimensional
functions of special kind.
Branin's idea of changing a system of simultaneous differential equations after
reaching a stationary point is used in Yamashita (1976) to solve a minimization problem
on a set
n} n m
'X.= { xER :h(x)=O where h=(hl' ... ,hm):R -7R ,m<n,
the functions f and hj are three times continuously differentiable and the matrix B(x) with
elements Bij(X)=ohjl OXi is of full rank. His work shows that the local optimizers
f
(maximizers as well as minimizers) of under restrictions h(x)=O are stable states of
trajectories corresponding to systems of differential equations
dh(x)
x= B(x)A.(x) - sg(x), (it=-h(x)
The discretization (2.1.15) may be used to solve this system. Sequentially alternating the
constant s sign after reaching a stable state allows to find several local minimizers of f
under restrictions h(x)=O. Certainly, the above approach does not guarantee that a global
minimizer is reached.
All algorithms based on solving differential equations have the following
disadvantages. First, there are no general results on their convergence to a global
optimizer. Hence, one may guarantee that a global minimizer is found only if it is assured
that all local minimizers are found. Second, the above algorithms are relatively
complicated for realization and investigation. In particular, it is necessary to evaluate the
objective function derivatives (for some algorithms also the Hessian) and the possibility of
using finite difference approximations instead of the derivatives is not obvious. It is
difficult to draw a general conclusion on the efficiency of the above algorithms. It should
be noted, however, that there are numerical examples which demonstrate that for fmding a
global optimizer the Branin method requires fewer evaluations of the objective function
than the tunneling algorithms. (Note however that Branin method requires also evaluations
of the first and second derivatives).
34 Chapter 2
~ 1 N
J (x, /3) '" N 2. [f (~ .)p(x - /3~ .)l /<p(~.) (2.1.18)
j=l J JJ J
are usually applied, where ~l""'~N are independent realizations of a random vector with
a probability density <p(x) which is positive on X. The use of (2.1.18) implies that the
evaluations of (2.1.16) are subject to random errors. Thus, to minimize (2.1.16) it is
necessary to use a stochastic approximation type algorithm. The solution of the problem
would be facilitated, if analogously to (2.1.18), one could use the Monte-Carlo estimators
(2.2.1)
can be detennined where 0 is a fixed positive number. If the points xl> ... ,xN are chosen in
X in such a way that the subsets Xl •...• X N fonn a covering of the set X
N
~that is Xc UX.),
i=l 1
then the global minimization problem (1.1.1) is solved with accuracy 0, with respect to the
function value. In this case, the inequality
(2.2.2)
hold. where
f * =min f· (2.2.3)
Methods of selecting points xI"",xN, having the above property are called covering
methods and are the main subject of this section.
A covering method may be described in tenns of either subsets Xl>"" XN or points
xI"",xN' In the fonner case one should try to construct the sets 'Xi of a maximal volume
and to reach a simple structure of the sets
k
X\UX.
i=l 1
for all k=I,2, .... In the latter case more fonnal methods to analyse the quality of point
collections :::N={ xlo""xN } or sequences {xl>x2.... } are generally used.
If the way of choosing points xiE:::N or sets Xi (i=l, ... ,N) does not depend on the
values which the objective function takes at the points XjE ::: N G;t:i) then the point set
:::N={Xlo""xN} is called a grid and the corresponding minimization algorithm is called a
grid (or passive covering) algorithm: these will be considered in Section 2.2.1.
36 Chapter 2
Section 2.2.2 describes sequential (active) covering algorithms in which all the
previously chosen points xl, ... ,xk-l and function values f(xl),···, f (xk-l) may be used
when choosing the next point xk (k=2,3, ... ).
Section 2.2.3 deals with the problem of the global optimization algorithms optimality
and of the practical usefulness of optimal algorithms.
A grid is a point set :::N= (xl> ... ,xN} constructed independently of the function f values
f(xi), i=l, ... ,N. A (passive) grid algorithm is a global minimization method consisting of
constructing a grid ::: N, computing the values f (xi) for all xiE ::: N and choosing the point
(2.2.4)
a) simplicity of construction,
b) optimality in some well-defined sense,
c) simplicity of investigation,
d) simplicity of realization on a multiprocessor computer.
The attribute a) is relative: this section shows that only a few grids for simply
structured set 'X (usually for cubes or hyperrectangles only) are constructed simply. But
even if a grid is not simply constructed, the corresponding grid algorithm still remains
simple and only one of its stages is difficult for realization~ Let us note that once this stage
is done, it can be used repeatedly, if one solves a multicriterion optimization problem or
several global optimization problems for the same set 'X.
The attribute b) is studied in Section 2.2.3 in detail. It turns out that for some types of
functional classes :F there exist grid algorithms that are optimal, typically in the minimax
(worst case) sense. But from the practical point of view, this optimality property gives
almost nothing: the reason being that the objective functions appearing in practice are
usually not like the functions at which optimal algorithms perform best. Nonoptimal
algorithms hence may be much better, than optimal ones, for most other functions from a
given functional class :F.
Global Optimization Methods 37
The attributes c) and d) follow by the fact that the location of the grid points is
independent of the function f values. The attribute d) is evident and c) is discussed in this
section and Section 2.2.3.
Grid algorithms may be characterized in many different ways. One of the most
important is the formation of a cover of X by the balls
(2.2.5)
where xiE ::: N, P is a metric on X, e is a radius of the balls. If the sets (2.2.5) form a
cover, i.e.
N
Xc UB(x.,e,p) (2.2.6)
i=l 1
then any point from X (and the global minimizer x* too) belongs to at least one ball from
the collection (2.2.5). If f E Lip (X ,L,p) then, under fulfilling (2.2.6), the minimization
problem is solved with the accuracy o=Le with respect to values of f, since
where Xj is the centre of the ball from the collection (2.2.5) which the global minimizer X*
belongs to. Similar results hold also for some other functional classes f' (see Evtushenko
(1985), p. 473).
Thus, the grid algorithms have some theoretically attractive properties; on the other
hand, they have a basic disadvantage, too. Namely, they completely neglect the
information on the objective function that is obtained during the search process. This
disadvantage makes the grid algorithms utterly inefficient, especially for large-dimensional
and complicated optimization problems. In practice, grid algorithms are efficiently used
only for searching the global extremum for several objective functions given on the same
set X (in particular, in multicriteria optimization, see Sobol and Statnikov (1981»; or as
parts of algorithms using random grids or deterministic ones constructed similarly to the
global random search methods of Chapter 5, see Galperin (1988), Niederreiter and Peart
(1986), Galperin and Zheng (1987). Nevertheless the theoretical significance of the grid
algorithms is non-negligible: this is caused by their minimax optimality, simplicity and use
as a pattern in comparative studies of global optimization algorithms.
A degree of uniform distribution in a certain metric is a quality criterion of a grid. Let
us introduce the appropriate notions.
Let !l be a finite measure on the feasible region X (X is considered as a measurable
space). A point sequence
{Xk} 00
k=l
is called uniformly distributed in measure !l in the set X, if xkE X for all k=1,2, ... and
the asymptotic relationship
38 Chapter 2
is valid for any measurable subset ACX, where SN(A) is the number of points xk
(19<:~N) belonging to A. If ~= ~n is the Lebesgue measure and (2.2.7) is valid, then the
sequence {xk} is called uniformly distributed in X. A grid :::N is called uniform (in
measure ~), if it contains the first N points of a uniformly distributed sequence (in
measure ~).
Uniformity is an asymptotic property: degrees of grid uniformity are usually expressed
in terms of dispersion and discrepancy defined as follows.
Let p be a metric on X. The value
(2.2.8)
is called the p-dispersion of a grid :::N={ Xl, ... , xN}. If p is the Euclidean metric then
(2.2.8) is called dispersion and is denoted by d(:::N).
The dispersion is one of most generally used characteristics of grid uniformity. If the
unit ball in metric p
[.:i
1/2
p(x,z) = (x(i) - Z(i»2] (2.2.10)
1=1
n
p(x,z) = L IxCi) - z(i) I, (p = 1) (2.2.11)
i=l
lip
p(x,z) = [.r I
1=1
xci) - zeit] l<p<oo (2.2.12)
p(x,z) = I
max xci) - z(i)1 (p=oo). (2.2.13)
l~i~n
Global Optimization Methods 39
(The last metric is also called cubic.) However, the above property is not valid e.g. for the
metric
n
p(x,z) = L a.1 x(i) - z(i)1 (2.2.14)
i=l 1
where the numbers a b ... ,an are nonnegative and not all equal to a fixed number. In
formulae (2.2.10) through (2.2.14) the values x(i) and z(i) for i=I, ... ,n are the
coordinates of points x,zeRn.
The importance of p-dispersion as a characteristic of a grid global optimization
algorithm is explained by the inequality
(2.2.15)
which is valid for any function f e Lip (X , L,p). Generally, the inaccuracy of a grid
algorithm may be estimated with the help of p-dispersion and the modulus of continuity
co p(t) = sup
x,ye x
IfCx) - f(y)1 (2216)
p(x,y)<t
(2.2.17)
Inequality (2.2.15) is a special case of (2.2.17), since cop(t).$.Lt for any feLip(X,L,p).
In this connection it should also be noted that if prior information about the location of the
optimal points is not available then one should prefer uniform grids to non-uniform ones.
If X is the unit cube, that is
x =[O,I]n , (2.2.18)
In case of (2.2.18), dispersion and discrepancy are closely related. In general, small
discrepancy values correspond to small dispersion. Niederreiter (1983) proved that for
any grid ::: N the inequalities
(2.2.20)
are valid. This way, if the evaluation of the dispersion is practically impossible, then the
inequalities (2.2.20) are used for estimating the rate of dispersion.
The next property of grids is important from the practical point of view. A grid is said
to be composite if it keeps its features, when the number of points N is changed into N+ 1
(for each N). Many known grids do not posses this property.
Let us define the most popular uniform grids (their majority can be defined for the case
(2.2.18) only). If the feasible set 'X is not a cube, then a uniform grid can be defined as
follows: find a cube Y containing the set 'X, construct a grid::: in Y and form a grid :::n'X
in 'X.
A cubic grid :::N 1 in 'X=[O,l]n contains N=pn points
(
i 1 + 1/2 i 2 + 11 2 in + 112) ,
P , P p il'i 2 , ... ,i n E {O,I, ... ,p-1},
where p is a fixed natural number. To construct such a grid, one may divide the cube 'X
into pn equal subcubes by dividing every side of 'X into p equal parts and choose the grid
points as the centres of the subcubes.
A rectangular grid ::: N2 for the case of the hyperrectangle
is constructed by dividing every side [ai,bi] of 'X into Pi parts oflengths fi = (bi-ai)/pi
and choosing the points
where ikE {0,1, ... ,Pk-1} for each k=l, ... ,n, N=pl> ... 'Pn. Of course, all cubic grids are
also rectangular.
A rectangular grid :::N2 is uniform in the case Pk=const(bk-ak), k=l, ... ,n , only. In
this context it should be mentioned that the uniformity property is not invariant to scale
transformations. For example, if a cube is transformed into a (non-cube) hyperrectangle,
then a uniform grid in the cube will induce a nonuniform grid in the hyperrectangle.
Cubic and rectangular grids are the simplest. They have some optimality properties (as
it will be shown below), but they are not composite.
We call a grid random, if it consists of N independent realizations of a random vector
uniformly distributed on 'X. Random grids :::N3 are simple to construct (not only for the
case (2.2.18) but also for many other types of sets), being also uniform and composite; on
Global Optimization Methods 41
the other hand their uniformity characteristics (p-dispersion and discrepancy) are far from
optimal.
Random grids are widespread in global optimization theory and in practice: there are
two reasons for this. First, if 'X is neither a cube nor a hyperrectangle, then tremendous
difficulties may be faced during the construction of grids having good uniformity
characteristics. Second, if the values of f in random points are known, then one may use
mathematical statistics procedures to obtain information on the function f and the location
of its extremal points.
Grids :::Ni (i=4,5,6) described below are defined on the unitcube 'X=[O,I]n and are
called quasirandom grids; their elements being called quasirandom points. This
nomenclature refers to the application of these points in a lot of algorithms similar to
Monte-Carlo methods that use random points. The uniformity of quasirandom grids is
better, than that of random ones, but slightly worse than optimal. From the practical point
of view, quasirandom grids are preferable to the optimal ones (including cubic grids): the
reasons for this preference will be given in Section 2.2.3.
The Hammersley-Holton grids :::N4 form an important class of quasirandom grids:
they consist of the N first terms of the Holton sequence that is defined as follows. For
integers Tj~2 and k~llet
~ .
k = I. a.TjI
i=O I
~ .
<p (k) = I. a. Tj-I-l
11 i=O I
and PI ,···,Pn be n distinct prime numbers. Then the i-th term xi of a Holton sequence is
x.
l I n
= (<pp (i), ... ,<pp (i», i = 1,2, ...
where {.} denotes the fraction part of a number, and a 1, ... ,a n are to be suitably chosen
from tables (see Korobov (1963), Hua and Wang (1981». Lattice grids are rather popular
in applied mathematics, note that they are not composite.
The computation of binary TI't-grids and Tj-adic TIo-grids ::: N 6 is more complicated,
see Sobol (1969, 1985), Faure (1982). The Tj-adic TI't-grids :::N6 are composite and have
the additional uniformity property. If the number N of the grid points is fixed then the
grids ::: N 4 and::: N 6 are sometimes modified so as i/N is substituted for the last n's
coordinate of the grid points Xi. This gives the grids that are not composite but possess a
little bit better uniformity characteristics, see Niederreiter (1978), Faure (1982).
42 Chapter 2
and sample ~ times each probability distribution Pi (i=I, ... ,m) determined by
(2.221)
The general term in the asymptotic expression (2.2.23) has the form B(n)N-l logn N
where
for grids::: N4 ;
B(n)-tO, n-t OO
Let us consider first the main idea of covering methods supposing that they designed
for optimizing an objective function with a fixed accuracy 0>0, with respect to its values.
Let the function f be evaluated at points xl"",xk- We shall call the value
a record and a point xk* with the function value f ( xk *)= f k * a record point. Let us
define the set
(2.2.24 )
Obviously
Points ofZk are not of interest for further search since the record fk* in the set Zk can be
improved not more than by o. Consequently, the search may be continued on the set
X\Zk only. In particular, if
(2.2.25 )
then the initial problem is approximately solved, since fk*- f* ~o, i.e. the record point Xk*
can be taken as an approximation for x*.
Thus, the construction of a covering method is reduced to the construction of a point
sequence {xk} and a corresponding set sequence {Zk} until the condition (2.2.25) is
fulfilled for some k.
Let us note that the record sequence {fk*} is decreasing and the set sequence {Zk} is
increasing, i.e.
Let us note also that the efficiency of covering methods depends significantly on the
closeness of records f k * and the optimal value f*, since the size of the sets Zk crucially
depends on this values. Consequently, to improve the efficiency of a covering method one
should use a local minimization technique just after obtaining a new record point. It gives
us an opportunity to decrease the incumbent records and, hence, to increase the size of the
sets Zk.
Global Optimization Methods 45
Let us consider the covering method for the case f'=Lip(X,L,p); methods for
functions which have a gradient satisfying the Lipschitz condition are constructed
analogously, see Evtushenko (1985), p. 472.
In the case f'=Lip(X,L,p) for every x,ye X one has
(2.2.28)
f *k - B ~f(y) -Lp(x,y)
for some ye X. Hence, it follows that the inequality (2.2.28) holds for any
xeB(xj.11jk,P) where l~j ~k, 11jk=(f(xj)-!k*+o)/L. Thus, we may take
For all k~l, a new point xk+ 1 can be obtained in different ways. For instance, to choose
new points one may use any grid point. Devroye (1978) proposed random grids for the
above aims, see later Algorithm 3.1.4.
In the class of covering methods based on (2.2.29) the method proposed by Piyavskii
(1967, 1972) and, independently, by Shubert (1972) is the most popular. This method
selects the point
xk
+
1 = argmin[
x
max [f(x.) -
l$;i$;k 1
L p(x, x.)]]
1
(2.2.30)
of the objective function f at each k -th iteration. Minimization in (2.2.30) is done either on
X or on X\Zk.
Following the Russian literature, we shall call the method (2.2.30) a polygonal line
method. This nomenclature originates from the fact that (2.2.31) is a polygonal line, in the
case when n=1 and p is the Euclidean metric. The deficiency of the polygonal line method
lies in the complexity of the auxiliary extremal problems (2.2.30). Wood (1985) described
46 Chapter 2
a constructive (but cumbersome) multidimensional variant of the method for the Euclidean
metric case. Lbov (1972) proposed to solve the auxiliary extremal problems by the
simplest global random search algorithm. Lbov's variant of the polygonal line method is
itself a global search algorithm and similar to the Devroye (1978) algorithm mentioned
above. A scheme generalizing the polygonal line method was developed also by Meewella
and Mayne (1988); related but different approaches will be treated in Section 2.3.
To simplify the basic covering method, one may use some subsets Xk of Zk instead
of Zk. An algorithm using hyperrectangles as subsets was proposed and studiedt,by
Evtushenko (1971, 1985). The one-dimensional variant of the Evtushenko algorithm has
the form
Xl = a, (2.2.32)
Ifxk>b then the iteration (2.2.32) is terminated. Here X=[a,b] and the subsets X\Zk are
intervals. The most unfavourable case for algorithm (2.2.32) is the case of a decreasing
function f in which points (2.2.32) are at equal distances.
Brent (1973) proposed a similar algorithm for the one-dimensional functional class
f'={f: I f"(X) I:s;M} where M is to be known. There are other similar one-dimensional
algorithms in Beresovsky and Ludvichenko (1984), Vasil'ev (1988).
Evtushenko (1985) constructed a multidimensional variant of his algorithm (2.2.32).
The sets X \Zk in it are hyperrectangles and the sets Xk are unions of cubes with centres
at points being obtained by a one-dimensional global optimization algorithm.
Evtushenko's algorithm is cumbersome but it does not require great computer memory or
complicated auxiliary computations. The numerical study of the algorithm indicates that
for dimensions n>3 it requires an exhaustive number of evaluations of f, and thus is
inefficient.
There are strong objections against covering methods to be applied directly in the
multidimensional case. First, the methods are cumbersome and complicated to realize.
Second, their efficiency considerably depends on the prior information about f, i.e. on the
choice of the functional class f'. In the most important case f'=Lip(X,L,p), the functional
class f' is determined by a Lipschitz constant L and the metric p. The inclusion
fELip(X,L,p) for some constant L and some metric p is usually a plausible conjecture,
but the explicit (minimal) value of the constant L is generally unknown, depending also on
the choice of X, the metric p and variable scales. Section 2.2.3 will demonstrates that the
unfortunate choice of variable scales leads to inefficiency of optimal algorithms (even for
known Lipschitz constant L). The same criticism is valid for any covering method.
Third, the number of objective function evaluations is excessive in multidimensional
case. Let us analyse this number for the case f'=Lip(X,L). Let X be a ball with radius TI.
The volume of X is
Global Optimization Methods 47
where r denotes the gamma-function. Let xl, ... xN be the points of function evaluations
and M=max f. One may only guarantee that f(x»j* for XEX, if f has been evaluated at
a point xi from the ball B(x,(f(Xi)-f*)1 L). Hence the balls
must cover X to assure that the global minimizer has not been overlooked. The joint
volume of these N balls is smaller than
(
M - f *)n n/2
N L 1t lr(n!2 + 1).
(2.2.34)
48 Chapter 2
If the dimension n of the set 'X. or the required solution accuracy increases, then the
number of objective function evaluations will rapidly increase for every grid algorithm.
For example, for the cubic grid algorithm we have
n
N ~ (v'il/2E)
where € is the accuracy in the argument values, in Euclidean metric. Numerical results for
the covering methods described above prove that for n>3 all of them have poor efficiency.
In this connection the question of best possible covering methods is of (mainly theoretical)
interest.
Let the number of steps (i.e. evaluations of the objective function f) be a priori
bounded by a number N. Every deterministic global minimization algorithm
dN=(d 1,... ,dN+1) is determined by mappings
k = 1, ... ,N + 1
That is, first a point Xl is chosen then the points xk (k=2, ... ,N) can depend on all
preceeding arguments and corresponding function values: xk=dk(Lk), where
The estimator of an optimizer x* is the point xN+ 1=dN+1(LN+ 1). It is usually determined
by xN+ 1=xN* and its inaccuracy defined by
N N
€(d ) = sup €<I,d ) (2.2.35)
fE1="
corresponding to the minimax approach for measuring the efficiency of algorithms. (Let
us remark right here that other approaches exist, too: e.g. the Bayesian efficiency measure
may lead to a class of stochastic algorithms, as will be described in Section 2.4.
Let 1l(N) be the set of all global minimization algorithms in which the number of steps
does not exceed N and ~(N) be the set of all N-point grid algorithms in which xN+1=xN *.
It is clear that ~(N) c 1l(N). An algorithm d*N is said to be optimal (in minimax sense),
if
Global Optimization Methods 49
d N*=arg N
.
mm e (d ) . (2.2.36)
N
d E1J(N)
There exists another way of optimality definition of global optimization algorithms. Its
essence is in fixing a bound Eo for the inaccuracy (2.2.35) and then minimizing the
number of steps N in a set of algorithms dN satisfying e( dN)::;eo . This approach makes
possible to introduce definitions similar to those considered above.
Let us assume now that :F=Lip(X,L,p). Then Sukharev (1971, 1975) proved that an
optimal passive algorithm (see (2.2.37» is the grid algorithm built by the grid ::: N8
consisting of the centres of balls B(xi,e,p) that form a minimal cover of X. For this
algorithm, the guaranteed accuracy
is equal to the radius of the balls (forming the optimal cover of X) multiplied by L.
In the mentioned works it is also proved that the optimal passive algorithms (2.2.37)
are also optimal - in minimax sense - among the sequentie1 ones (2.2.36), that is
(2.2.38)
50 Chapter 2
(The maximum on f' is attained at a saw-tooth function whose values at all points
xl, ... ,xN are equal, details can be found in the papers cited.)
In terms of game theory, the above result may be explained as follows. A researcher
chooses a minimization algorithm which is known to his enemy (say, to nature or to an
oracle ). The latter selects then the most unsuitable function for the algorithm in f'. This is
the saw-tooth function mentioned, having a maximal rate of variation and equal values at
points generated by the algorithm. This way the enemy eliminates the possibility of
collecting valuable information concerning the objective function. So, passive algorithms
are not worse (in the worst case sense), than sequential ones.
The supposition of choosing the objective function by an enemy who knows the
optimization algorithm is, of course, doubtful in practical optimization problems. The
conclusion obtained (concerning passive algorithm optimality) causes additional doubts
concerning the adequacy of the minimax approach. Note in this context that this approach
is of great interest for uniextremal or convex functions, see Kiefer (1953), Chernousko
(1960); for multiextremal problems however, this approach calls for additional problem
specifications.
Sukharev (1975) also studied a different but related concept of best algorithms. To
construct these algorithms, he supposed that nature may deviate from its optimal strategy:
this concept leads to a certain sequential algorithm. It is constructed as follows: let the
number N of objective function evaluations be fixed and k evaluations have been already
done at points xl, ... ,xN' Then the point xk+l is chosen as the centre of a ball coming into
the optimal cover of the set X\ Xk, where
and 11 is the fixed lower bound of the radii of balls which form the optimal N-balls cover.
For the worst function the points xl "",xN generated by this type of algorithm coincide
with the optimal grid algorithm points. For the other functions, the accuracy of the
sequentially best algorithm may be better. But the main problem of Sukharev's algorithm
lies in its very complicated construction, since optimal covers ought to be built on every
step of the algorithm.
If N=pn, where p is a natural number, and the metric P=Po is cubic (Le. (2.2.13) is
fulfilled), then the cubic grid algorithm is the minimax optimal global minimization
algorithm for the Lipschitz functional class f'=f' o=Lip(X,L,po). The same algorithm is
also optimal for the more general functional class f' s, s~O, consisting of functions whose
derivatives up to order s are from f' 0 ' see Ivanov (1971, 1972). In the latter case, the
algorithm has a guaranteed accuracy of order O(L/N(S+ l)/n) for N~oo. Ivanov et al.
(1985) described another approach for the construction of an optimal in order algorithm in
f's. It consists of minimizing a spline that is built using the values of f at uniformly
chosen grid points.
As stated above, optimal (in the above sense) deterministic algorithms can be built
under the supposition of knowing the minimization algorithm by an enemy. The
researcher may make an attempt to increase his gain (that is, the guaranteed accuracy) by
using a randomized strategy. In this case it is supposed that nature does not know the
Global Optimization Methods 51
researcher's strategy, but knows its statistical characteristics. Sukharev (1971) considered
passive randomized algorithms aN for the case 1"=1"0. They are determined by the joint
probability distribution of N random vectors in X. We denote these distributions by
aN(d:::N). The mean accuracy of a randomized algorithm oN for a function f is the value
Sukharev (1971) proved that in the one-dimensional case, for X=[a,b], the optimal
randomized algorithm a*N is determined by a random choice of the grids
~
'::'N,l =
{o. 2N-l'
2 2N-2} d ~
... , 2N-1 an '::'N,2=
{l 3
2N-l' 2N-1 , ... ,
I}
with equal probabilities 0.5. This algorithm has the guaranteed mean accuracy
Note that this value is almost twice less, than the guaranteed accuracy L/2N of the optimal
deterministic algorithm: the latter is the cubic grid algorithm for the grid
Sukharev (1971) also showed that for all n~l, the inequality
Let us note here that in spite of Shubert's (1972) assertion the polygonal line algorithm
(2.2.30) is not one-step optimal (as easily seen already for N=2); following Sukharev
(1981), a slight modification of the algorithm is one-step optimal. The sequential
algorithms will be studied later.
The aim of further study in this section is to show that minimax-optimal global
minimization algorithms may have very poor efficiency in realistic optimization problems.
The results below question again the practical significance of the above and similar
(minimax-type) optimality criteria and the optimal algorithms derived.
Let X=[O,I]n, f'=Lip(X,L) and let the objective function f depend on s variables
(s<n) with coordinate indices iI, ... ,i s (l.::;i I < ... <i~n) but not on all n variables. Let us
denote by K=K(il> ... ,i s) the corresponding s-dimensional cube, that is the s-face of the
cube X in which the variables of indices i I ,... ,is vary from zero to one and all others are
equal to zero.
A grid :::N on X induces the grid :::N(S) on K consisting of projections of the grid :::N
points onto X.
For the dispersion of a grid :::N(S) we have
(22.39)
(2.2.40)
replacing (2.2.15).
In fact the case of a function defined on an n-dimensional set, but depending only on s
variables is a hardly probable case in practice. However, it is usual that an objective
function depends on all n variables, but the degree of dependence is different: there is a
group of essential variables that influences the function behaviour more intensively than
the others. In other words, in complicated cases one may expect that the objective function
f has the form f =h+g where h»g and h depends only on s variables (s<n). But in this case
the accuracy of a grid-type minimization algorithm depends on values d(:::N(s» for
different s<n, il> ... ,i s and not on the value d(:::N) only. Thus, the collection (d(:::N(s»}
for different s.::;n, iI, ... ,i s is a natural vector criterion of a grid algorithm. Let us use this
indicator to compare the cubic, random, and quasirandom grids.
If N=pn and :::N=:::N I is a cubic grid, the grid :::NI(s) contains pS distinct points only.
This gives
see Sobol (1982). The rate of decrease of the values d(:::N(s» for N-+oo is not influenced
by s. For small values s, this rate is much worse than the optimal rate N-l/s.
Global Optimization Methods 53
At the same time, for every N the projections::: N(s) of random::: N3 and quasirandom
i
::: N (i=4,5,6) grids contain N distinct points. So, for random grids with any probability
less than one there holds
N ~oo,
N~oo, i = 4,5,6.
These relations together with the right side of the inequality (2.2.39) imply that for
random grids (with any probability less than one) we have
N ~oo, (2.2.42)
_i -1/ s
d(.:::. N(s)) = 0 (N log N), N ~oo. (2.2.43)
By (2.2.41) - (2.2.43) we arrive at the conclusion that the quasirandom grids qualitatively
surpass the random and cubic ones by the criterion d(:::N(s)) for all s<n and the random
grids surpass the cubic ones for s<n/2. The rate of decrease of d(:::N(s)) for N~oo is
nearly optimal for quasirandom grids for all s.$n.
Thus, for a functional subset :FcLip(X,L) under consideration the cubic grids are
worse (in the above sense) than quasirandom ones and even may be worse than random
grids. Recall that the cubic grid minimization algorithm is optimal for the case
:F=Lip(X,L,po) where Po is the cubic metric and it is optimal in order for:F= Lip(X,L).
We shall demonstrate now that a similar situation takes place when it is supposed that
f E Lip(X,L,p) , but in fact f E Lip(X,L,p 1) where p and PI are metrics for which
Lip(X,L,Pl)CLip(X,L,p ).
Let X=[O,l]n and fE:F l=Lip(X,L,Pl) where the metric P=Pl is defined by (2.2.14);
furthermore let
The relation fE:F1 follows by fE :F2= Lip(X,L,P2) where the metric P=P2 is defined by
(2.2.11). The condition fE:F2 is a typical condition of f in theory as well as in practice
(because the precise information about f is always absent and while formulating the
Lipschitz condition on f one usually chooses a metric from the collection (2.2.10) -
54 Chapter 2
(2.2.13) which contains metrics having the unit p-ball symmetry property discussed
earlier).
The relation fe:F2 means that
n
If(x) - f(z)1 $; L L Ix(i) - z(i)1
i=1
where Li = ai L 5. L for all i=I, ... ,n. For any function fe:F2 the true (minimal)
constants L i exist. They are usually unknown but it is precisely them (not L) that
determine the true accuracy of a global optimization algorithm.
We shall suppose that apO for each i=I, ... ,n (the opposite case was considered
earlier) and introduce aO and alas the arithmetical and geometric mean values
According to (2.2.15) the value d Pl (=:N) is a natural quality characteristic of a grid =:N for
the case :F=Lip(X,L,Pl).
If apO for each i=I, ... ,n and the natural numbers n,N are arbitrary, then (see Sobol
(1987»
::,1) _ .!..
d p (...... ON-lin (2.246)
N - 2 na
1
(22.47)
for an T\-adic II't-grid =:N6 where the constant c does not depend on n, N.
The formulas (2.2.45), (2.2.46) lead to the following conclusions. If all the values Lj
in (2.2.44) are positive, then cubic grids are optimal in order, but the ratio of the right
Global Optimization Methods 55
sides of (2.2.45) and (2.2.46) can be arbitrarily small (for sufficiently large n). If some
values Li equal zero, then the cubic grids are not optimal in order (this corresponds to the
above case). If L 1= ... = L n then the ratio of the right sides of (2.2.45) and (2.2.46)
exceeds e- 1 for all nand N: this is again an evidence of high (theoretical) efficiency of
cubic grids for this case.
Comparing now the right sides of (2.2.45) and (2.2.47), we find that quasirandom
ITt-grids are optimal in order, uniformly with respect to values Ll =... = Ln. Since these
values are usually unknown and may be rather different from each other, uniformly
optimal (in order) grids are to be preferred to optimal grids for the case L 1=... = Ln. In
particular, according to the above criterion, nt-grids are better than cubic ones (though in
the classical minimax sense the opposite judgement holds).
56 Chapter 2
2. Renumber in the increasing order k points xb ... ,xk with objective function values
f(Xi), i=l, ... ,k. Then we have a=xl < X2 < ... < xk=b.
3. For all subintervals Lli=[xioxi+l], i=l, .. ,k-l, of the interval 'X=[a,b] calculate the
interval characteristic function value
which depends on the vertices of subinterval Lli, the best function value f k * found so for,
and function values at the vertices. Choose the index (one of the indices)
4. If the length of subinterval Llj is less than or equal to a fixed small positive number £
(that is Xj+ l-Xj~€) then the algorithm will stop (other numerical stopping criteria may also
be used). Otherwise take xk+ 1=S(j), where
is a function independent of the objective function values not belonging to the subinterval
Llj except the value fk * (the same property has the function :R..).
The convergence of the above general partition scheme was investigated in Pinter (1983)
and was generalized for the multidimensional case in Pinter (1986a,b).
Let us consider the forms of functions :R..,8 generated by some well-known
algorithms. To obtain Strongin minimization algorithm (see Strongin (1978» under the
supposition f E Lip ('X,L ), one should choose
(2.3.1)
8(j)= (x.+ x. 1)/2- (f(x. 1) - f(x.») /(2[,). (2.3.2)
J J+ J+ J
In case ofF=Lip('X,L,p) the functions :R..,8 for the Strongin algorithm are as follows
58 Chapter 2
(23.3)
X,+X'
J J+ I 1 -I (If(x.I)-t<x·)I)
J+ J
8(j)= 2 -2sign(f(Xj+I)-f(Xj»11 L
(23.4 )
where 11 (z)=p(z,O), 11- 1 is the inverse function of 11. The polygonal line method of
Piyavskii-Shubert (2.2.30) in case off'=Lip(X,L) is obtained by defining
and determining 8 by (2.3.2). (Note that - as it can be seen easily - if one wants to
maximize an objective function f, then it is neccessary to change the sign minus to plus in
the last items in (2.3.1) - (2.3.5).) Note also that the one-dimensional algorithms
presented later in Section 2.4 can also be represented in the frames of Algorithm 2.3.1.
2. Choose a random line passing through the point xk (or choose a random direction
emanating from xk) by generating an isotropic probability distribution with centre xk (for
instance, the uniform distribution on the surface of the unit sphere 8={XE Rn: II x-xkll =1}
can be used).
3. Choose five equidistant points on the above line with Xk as their middle point. Evaluate
the objective function at the above points.
6. Choose xk+ 1 as the point with the smallest objective function value in the point set
containing five points from Step 3 and not more than three ones from Step 5.
Note that there are many algorithms similar to Algorithm 2.3.2. The essential part of the
method consists of a fourth-degree polynomial approximation of the function
where j is a continuous function on the unit cube X=[O,l]n and the point XE X has the
coordinates x(i), i=l, ... ,n. In particular, for n=2 the representation (2.3.6) gives
where
60 Chapter 2
One may use the formula (2.3.6) for reducing the original optimization problem on a cube
to a number of one-dimensional global optimization problems (but usually this number is
very large). If n~3 and the functional class f" is broad enough (such as Lip(X,L», then
the algorithms based on (2.3.6) are cumbersome and inefficient. But for some relatively
narrow classes f" the multistep reduction scheme may serve as the base of efficient
algorithms. For example, if the objective function f is separable, i.e.
n
f(x) = 'Lf·(x(i»
i=l 1
where f b···,f n are one-dimensional functions, then according to (2.3.6) one has
n
min f(x) = 'L min f. (x(i».
XE'X i=l O~x(i)~l 1
In this case to solve the n-dimensional optimization problem means to solve n one-
dimensional ones. In a practically more important case, f is represented as follows
n-l
f(x)= I f.(x(i),x(i +1» (23.7)
i=l 1
where
<P2( x(2» = min f l(x(l), x(2», <p. (x(i» = min [<p. _l(x(i - 1» + f . _lex (i - l),x(i»]
x(l) 1 x(i-l) 1 1
for i~3. Hence, to solve a global optimization problem for the objective function (2.3.7),
one may tabulate one-dimensional functions <Pi (i=2, ... ,n) whose values are solutions of
one-dimensional global optimization problems. A more detailed description of the
multistep reduction scheme is contained in Strongin (1978).
Another way of reducing multidimensional global optimization problems to one-
dimensional ones is the dimension reduction by means of Peano curve type mappings.
These are continuous maps <P of interval [0,1] into the cube X=[O,l]n. Using them one
has
Global Optimization Methods 61
The possibility of using one-dimensional algorithms for optimizing the objective functions
g(t)=f(<p(t)) is followed by the next proposition due to Strongin (1978): if fELip(X,L),
then gELip([O,l],Mn(L),Pn) where Mn(L) is a constant depending on L,n,<p and Pn is
the metric defined by formula
Vn
Pn(t,t')=lt-t'1 .
point. If splines are used for approximation in the above methods then they may possess
some optimality properties, see Pevnyi (1982, 1988), Ganshin (1977). In spite of this, the
mentioned methods have no significant practical importance, because the goodness criteria
for optimization and approximation algorithms are different in essence (viz. approximation
accuracy has to be assured for all the set 'X., while in optimization in the vicinity of a
global optimizer only). Numerical investigations confirm this. For example, Chujan
(1986) searched the global optimizer of some one-dimensional functions several hundred
times more economically than the mentioned Pevnyi (1982) algorithm which is based on
utilizing splines.
Ideas of approximation are fruitfully applied in global optimization methodology in the
following way: a rough approximation of the objective function is constructed (not
necessary in an explicit form); nonpromising subsets of 'X. are determined by this
approximation and are excluded from further search 'X.; in the remaining subset of 'X., a
more accurate approximation is built, and analogous operations are carned out until a
given accuracy is attained. A considerable part of global random search algorithms treated
in this book are constructed according to this principle.
Another way of applying approximation in global search is to construct algorithms
using multiple approximation of the objective function projections (they may be one-, two-
or more dimensional): a typical example is Algorithm 2.3.2.
Global optimization algorithms, consisting of the solution of ordinary differential
equations or systems of such equations are important as well: these algorithms were
described in Section 2.l.3. The connection of differential equations and global
optimization theories is based on the fact that trajectories corresponding to solutions of
certain classes of differential equations, contain (or converge to) one or more local
optimizers of a given function.
Another type of connection is between stochastic differential equations and global
optimization theories. Section 3.3.3 shows for some functional classes 'f that if e(t)
approaches zero slowly enough as t tends to infinity then the trajectory corresponding to
the solution of the stochastic differential equation
x*(j) = lim Jx(j) exp {Af (x) }dx/f exp {A,f(x)} dx. (2.3. 8)
A~OO
and
General conditions, under which relations (2.3.9) and (2.3.10) hold, will be given in
Section 5.2.2.
If f is a continuous nonnegative function then for any A>O the evident inequality
(23.12)
(opposite to (2.3.11), is fulfilled, then the point x* can not be a global maximizer of f.
The condition (2.3.12) is sufficient for a point not being a global maximizer. It is non-
constructive and thus seems to be of small practical significance. Namely, if evaluating the
integrals in (2.3.12) one finds a point X(o)EX such that f(x(o»f(x*) then x* is surely not
a global maximizer and if such a point x(o) is not found then the inequality (2.3.12) for
estimators of the integrals will not be valid.
Analogously to (2.3.11) and (2.3.12), a necessary and sufficient condition
for a point x* to be a global maximizer can be obtained (this condition is also non-
constructive).The representations (2.3.8) - (2.3.10) are more constructive: they are basic
for some global optimization algorithms (e.g. see Ivanov et al. (1985» involving
simultaneous estimation of several integrals. (Note that the problem of optimal
simultaneous Monte-Carlo estimation of several integrals will be studied in Section 8.2.)
64 Chapter 2
The main idea of branch and bound strategies is the sequential rejection of those subsets of
X that can not contain a global minimizer and then searching only in the remaining
subsets (regarded as prospective ).
At the k-th iteration of a branch and bound method it is necessary to construct a
partition (i.e. branching) of the optimization region (at the first iteration this region is X)
into a finite number Ik of subsets Xi, on which lower bounds ti of
L= minL
J . I I
IE k
into smaller subsets - and the iterations are continued.
The convergence problem and implementation aspects for the above class of (global
optimization) methods is investigated under different conditions, see Horst (1986), Horst
and Tuy (1987), Pinter (I986a, 1988). Convergence is ensured by the fact that the lower
bounding procedure is aymptotically accurate, i.e. ~i converges to mi, when the volume
of Xi approaches zero.
Of course, for too broad functional classes, finding a lower bound for the global
minimum is of similar difficulty as finding the global minimum itself. Thus it is possible
to construct efficient branch and bound methods only for sufficiently narrow functional
classes 1=": examples of such classes 1=" are considered below.
There are many variants of the technique under consideration. As noted, e.g. the
majority of covering methods may be referred to as branch and bound procedures, see
Horst and Tuy (1987). The same is true for the one-dimensional partition algorithms
described in Section 2.4.1. Here we shall deal with other algorithms.
First let us follow McCormick (1983) and assume that X is a hyperrectangle
j p q
f (x) = f (x) + f (x) for some p,q < j, (2.3.15)
j p q
f (x) = f (x)f (x) for some p, q < j, (2.3.16)
or
j p
f (x) = <p(f (x» for some p < j, (2.3.17)
where <p belongs to a given class <I> of sufficiently simple functions <p:R-tR (e.g.
<p(t)=tU , <p(t)=e t, <p(t)=sin t, etc.).
It is easy to verify that the above factorization is a natural way for representing
functions which are given in an explicit algebraic form.
Let the subsets Xi be hyperrectangles
j ,..i .
t .(x) S; J (x) S; u~ (x) for all x EX.1 (2.3.18)
1 1
and computing
~i = inf t ~(x).
XE'X.
1
One may use different ways for constructing convex functions t ij(X) and concave
functions Uij(X) with the property (2.3.18).
The simplest interval arithmetic methods described later use constant functions
tij(X)=ti and Uij(X)=Uij . The opposite approach is to find the best possible bounding
functions, i.e. taking tij(X) as the convex lower envelope of fj(x) on Xi and Uij(X)=Uij
equal to its concave upper envelope. For functions (2.3.14) we have
j . ,..i
t .(x)= uJ.(x) = J.(X), XE X 1..
1 1 1
Let (2.3.16) hold, LiP, Liq and ViP, Vi q be lower and upper bounds for fP(x) anf fq(x)
over Xi, respectively. If LiP,:::;O, Liq.:::;O and ViP~, Viq~ then we may take
j { q p p q p q q p p q p q}
t.(x)=max V.t.(x)+V.t.(x)-V.V., L.L(x)+L.t.(x)-L.L. ,
1 11 11 1111 11 11
j
u.(x)=mm
. { p q q p
L.u.(x)+V.u.(x)-L.V.,
p q q p p q
L.u.(x)+V.u.(x)-L.V ..
q P}
1 11 11 1111 11 11
It has been mentioned that constant functions ~ij(X) and Uij(X) satisfying (2.3.18) generate
so-called interval methods. Let us consider these methods which are of considerable
theoretical (and of increasing practical) importance.
Interval methods are aimed at finding the global extremum of a twice differentiable
rational objective function I defined on a hyperrectangle 'X and having a gradient VI and
a Hessian V 2I with only a finite number of (isolated) zeros. Their essence is in evaluation
of images I (Z), V I (Z), and V 2 I (Z) for hyperrectangles ZC'X with the purpose of
excluding those which can not contain extremal points. Their main drawback seems to be
the relatively restricted class of optimization problems which can be solved by these
methods, and their substantial computational demand.
Let us introduce some notions which are necessary for describing interval methods.
Let Zic'X, i=1,2, be intervals Zi=[ai,bi]. We shall call them interval variables and define
the interval arithmetic operations by
U sing these formulas, one may evaluate the interval extension of a rational function f, i.e.
the image
Oe f( ZI )) one may use the interval version of the Newton method. IfOi: f"( ZI ) , then
the interval Newton step has the form
where Zj is the midpoint of Zj. According to Hansen (1979), ifOi: f( ZI ), then the set Zj
is empty for some j and if OE f' ( Z 1 ), then the point sequence { Zj} converges to a
stationary point of f quadratically.
In the multidimensional case, the interval methods have the same form but their
practical realization is complicated, because of the necessity to store, choose and divide a
great number of subrectangles of 'X.
A detailed description of interval global optimization methods may be found in Hansen
(1979, 1980, 1984), Ratschek (1985). Mancini and McCormick (1976) described interval
methods for minimizing a convex function. Shen and Zhu (1987) suggest an interval
version of the one-dimensional Piyavskii and Shubert algorithm (2.2.30). On the whole,
interval methods represent a set of promising global optimization approaches but the class
of extremal problems which may be efficiently solved by them is naturally restricted by
their analytical requirements.
Another class of problems, in which branch and bound methods are used
advantageously, consists of concave minimization problems under convex constraints (for
details and references see Pardalos and Rosen (1986, 1987)). Following Rosen (1983),
consider the special case where
(2.3.20)
Lower and upper bounds of f* are needed. To compute an upper bound at the beginning,
f is maximized over 'X. This gives a point XOE 'X. Then the n eigenvectors el> ... ,en of
the Hessian at Xo are determined. To move as far away as possible from Xo, one solves 2n
linear programming problems and find vectors
where wi=ei, wn+i=-ei for i=I, ... ,n. An upper bound for f* is U=min{ f(Vl), ... ,
f(V2n)}'
whose intersection is a hyperrectangle 'Xo containing 'X. A lower bound L for f* is the
minimum of f over 'Xo which is attained at one of the 2 n vertices of 'Xo.
We also construct a hyperrectangle inscribed into the ellipsoid
of the form
where the constant di can easily be computed. Now, it can be seen that x* can not be
contained in the interior of this hyperrectangle and the intersection of its exterior with 'X
defines an appropriate family of subsets in which x* is looked fOf.
Note that the branch and bound technique was used also for some more general
problems than (2.3.19) - (2.3.20), including the case, in which f is the difference of two
convex functions and 'X is a convex set.
Let us also note that Beale and Forrest (1978) applied the branch and bound technique
for minimizing one-dimensional functions of the type
m
f(x) = L f .(x)
i=l 1
where fi (i=1, ... ,m) are twice continuously differentiable, the values fi (x), fi' (x), fi"(X)
can be calculated for any point xe'X , and the set 'X can be a priori divided into subsets
where all second derivatives are monotone.
Finally it should be pointed out that Chapter 4 presents a generalized branch and
bound principle for the case when estimators for the lower bounds (2.3.13) are valid only
with a large probability: this generalization will be called the branch and probability bound
principle.
2.4 An approach based on stochastic and axiomatic models of the
objective function
In the previous sections, many deterministic models of the objective function were
considered. Other classes of models are also used for the description of multiextremal
functions and construction of global optimization algorithms: the most known of these is
the class of stochastic models that uses a set of realizations of a random function as :F.
(det V)1!2 1
p(u,I!,V)= k/2 exp{-2:(u-I!)'V(u-I!)}
(21t)
where
k
uER, I! = (I!(X 1),···,I!(X k »', I!(X) = Ecp(x),
k -1
R=II T)(xi,xj)11 ,T)(x,z)=E(cp(x)-I!(x»(cp(z)-I!(z», V=R .
i,j=l
The class of realizations of the classical Wiener process determines the most popular
stochastic model of one-dimensional functions f (x), XE X=[O, 1]. It is characterized by the
functions
In the case of Gaussian random functions, the marginal distributions conditioned by any
number of calculated values of f are still Gaussian and can be computed in the following
way. Let Y 1 =(y 1 ,··.,Yk)' be the vector of values Yi= f (xi) (i=l, ... ,k),
Y2=(<P(zl), ... ,<p(zm»' be a Gaussian random vector of unknown values of <p at points
Z 1,... ,zm in X. conditioned by the evaluations Y 1. Set
cov( Y
y 1) =
R
= (Rll Rl~
R R '
2 21 2
where V 11 and R11 are of order kxk, V22 and R22 of order mx m. Then
(2.4.2)
(2.4.3)
Formulas (2.4.2) and (2.4.3) are usually applied in the case m=1. For some particular
cases of covariance function l1(x,z), they are not very complicated. For example, if m=l,
ll(x)=O, and
k 2
i~1(l/(1+llz1-xdl ) (2.4.5)
Let us use the notations of the beginning of Section 2.2.3, but apply the Bayesian
(statistical) approach for defining the accuracy of a method, replacing the minimax
approach based on (2.2.35).
The accuracy of an N-point method dN=(dl, ... ,dN+l) can be defined in various
statistically meaningful ways. For instance, the algorithm defined by
is called optimal with respect to the expected value (E-optimal), and the algorithm
(24.8)
where I-k denotes the conditions q>(xO=Yi for i=I, ... ,k. Furthermore, the one-step
analogy of the P-optimal algorithm (2.4.7) is
Let X=[a,b] and q>(x,ro) is the Wiener process with mean and covariation function defined
by (2.4.1), where Il and (}"2 may be unknown. It is an acceptable model for the global
Global Optimization Methods 73
If 0-2 is unknown then every algorithm has to start with its estimation. To this end, it
is usually recommendable to evaluate f at m equidistant points xi=a+(b-a)(i-1)/(m-1),
19~m, and estimate 0-2 by the maximum likelihood estimator
(2.4.10)
(2.4.11)
for each XE Lli=[xi,xi+ 1], i=l, ... ,k-l. Moreover, the expectation of the minimum value
<p. = min <p(x)
1 XE~
i
in an interval Lli conditioned on Lk is
(Y j +1 -Y j ) 2 } 0 2
x exp{ 2
20- (x. - x.)
f
-00
p(t,-IY'
1+
1- Y·I,0- (x.
1 1+
1- x .»dt
1
(2.4.12)
1+1 1
where
-1/2
p( t,a,02) = (21t02) exp{ - (t - a)2/202}
is the Gaussian density. By (2.4.12) one can compute the posterior mean of <Pi for each
interval Lli and select the interval
74 Chapter 2
Ll j = arg 1n E(<PiILJ.
i
The next point xk+ 1 can be chosen in the interval Llj in different ways. The simplest one is
xk+ 1=(Xj+Xj+ 1)! 2, i.e. xk+ 1 is the centre of Llj- If we want to confine ourselves to
Bayesian techniques then it is natural to select xk+ 1 as the expected location of the
minimum <Pj but the corresponding formulas are rather complicated. (Note that this
approach was followed by Boender (1984), where favourable numerical test results are
also given.)
The one-step P-optimal algorithms (2.4.9) are determined in an easier way:
R(.) =
<f *k - £ k)(X. 1 - x .) - y. 1(a. - x. ) - y. (x. 1 - a.)
1+ 1 1+ 1 1 1 1+ 1
1 1/2 '
a«a i -xi)(xi+l-ai)(xi+l-xi))
The efficiency of algorithm (2.4.13) depends to a considerable degree on the choice of £k.
Zilinskas (1981) proposed to choose
As a stopping rule for the above algorithms, one may choose the following: reject
subintervals Lli=[xi,xi+ 1], if the probability of finding a function value in Lli better than
the current optimum estimate fk *, i.e.
is sufficiently small (not greater than a given number £0>0), and terminate the algorithm if
all subintervals except the one corresponding fk * are rejected. Note that if this stopping
rule is applied then the algorithm can be regarded also in the class of branch and
probability bound methods described in Section 4.3. Besides, the algorithms of this
subsection can be incorporated by the general scheme of one-dimensional global
optimization algorithms, represented in the form of Algorithm 2.3.1.
Global Optimization Methods 75
The stochastic function models described above can be viewed as special cases of a more
general axiomatic approach. According to this, the uncertainty about the values f (x) for
xeX\2k is assumed to be representable by a binary relation ~x where (t,t')2.xCt;t')
symbolizes that the event { f (x)e (t,t')) is at least as likely as the event { f (x)e ('t,'t')).
Under some reasonable assumptions on this binary relation (e.g., transitivity and
comleteness),there exists a unique density function Px that satisfies the following
condition: for every pair (A,A') of countable unions of intervals one has A.2.x A' if and
only if
f P x(t)dt2 fPx(t)dt.
A A'
For the special case when all densities are Gaussian and hence are characterized by their
means f.l(x) and covariances cr 2 (x) one can suppose that the preference relation .2.x is
defined on the set of estimators of f.l(x) and cr2(x).
Subject again to some reasonable assumptions about this preference, the result is that
the unique rational choice for the next point of evaluation of f is the one for which the
probability of finding a function value, smaller than fk *-10k is maximal. (This result
justifies the one-step P-optimal algorithms (2.4.9». In the case of one-dimensional
Wiener process, (2.4.9) together with (2.4.10) and (2.4.11) lead to (2.4.13). In the case
of higher dimension, analogies of (2.4.10) and (2.4.11) are not valid, but some
approximations for f.l(x ILk) and cr 2(x ILk) can be axiomatically justified, e.g.
k
f.lk(XIL k) = i:/ i W i(X,LJ,
2 k
cr lxlLJ = cki~llix - xdIW i( x, L k)
where ck is a normalizing constant and the weights Wi(x,Lk) have some natural
properties, see Zilinskas (1982, 1986).
76 Chapter 2
The information-statistical approach is similar to the above described Bayesian one and
was mainly developed by Strongin (1978). Its essence is the following: the feasible region
'X is discretized, i.e. a finite point collection 3N={Xl,... xN} is substituted for 'X and the
N-vector F=(f(x 1), ... , f (xN» approximates the objective function f. So, RN is
substituted for the functional set f'. Setting prior information about f consists in setting up
a prior probability density cp(F) on RN which must be successively transformed into a
posterior density after evaluating f. The points at which to evaluate f can be determined,
for instance, as the maximal likelihood estimators for x*. This idea leads to extremely
cumbersome algorithms which are practically not manageable in the multidimensional
case. In the one-dimensional case, however, a slight modification of this idea led Strongin
(1978) to the construction of the algorithm (2.3.1) - (2.3.4).
PART 2. GLOBAL RANDOM SEARCH
The present chapter contains three sections. Section 3.1 describes and studies the simplest
global random search algorithms, outlines the ways of constructing more efficient
algorithms, presents a general scheme of global random search algorithms and discusses
the connection between local optimization and global random search. Section 3.2 proves
some general results on convergence. Section 3.3 is devoted to Markovian algorithms,
that are thoroughly theoretically investigated in literature.
(3.11)
4. If k=N, then terminate the algorithm; choose the point xk* with f ( xk*)= fk * as an
approximation for x*=arg min f. If k<N, then return to Step 2 (substituting k+ 1 for k).
Algorithm 3.1.1 has some different names, viz., crude search, pure random search,
random bombardment, Monte Carlo method, etc. It utilizes the simplest stopping rule
77
78 Chapter 3
(3.1.2)
where Iln is the Lebesgue measure, and (3.1.2) becomes an equality if {XE Rn: II x-
x*lI~e}CX, i.e. if the distance from x* to the boundary of X is not less than e. Using
(3.1.2) we obtain for all e>O, k=I,2, ...
(3.13)
k
:::;; 1- [1-n n12e n/(lln(X)1(n/2 + 1»J ~ 1, k ~oo.
(3. 14)
(3.1.5)
where 'tA is the moment of first hit of the search sequence xlox2, ... into a set ACX.
These formulas estimate the rate of convergence of Algorithm 3.1.1 with respect to values
Main Concepts and Approaches 79
of the argument. The rate of convergence with respect to function values is estimated
analogously:
(3.1.7)
Note, that to simplify calculations in (3.1.4) and (3.1.6) one can use the approximation
(l-p)k "'e-kp (valid for p",O).
Although Algorithm 3.1.1 converges in various senses, this convergence is slow and
greatly depends on the dimension n of the set X. Let us calculate how many evaluations N
of f one has to perform, in order to reach a probability not less than l-y (where ')'>0 is a
small number) of hitting into B(£). Supposing that equality holds in (3.1.2) (that certainly
holds for x*e int X and sufficiently small £) compare the right-hand side of (3.1.4) with
(1-y) and solve the obtained equation with respect to k=N. We get
Let us take J.!n(X)=I, y=0.1, £=0.1 and consider in Table 2 the dependence N=N(n).
N 11 73 549 4666 43744 4.5x 105 5.7x 107 9x 109 9x 1021 10140
Some related recent results of Deheuvels (1983) and Janson (1986, 1987) concerning
multivariate maximal spacings are of interest in the context of global random search
theory. We shall present the principal results below.
Let J.!n(x)=l, A be a cube or a ball in:R.n of volume J.!n(A)=l, 3N={xl> ... ,xN} be an
independent sample from the uniform distribution on X. Set ~N=suP{ll: there exists
xe:R.n such that x+llAC X\3N} and define the maximal spacing as VN=(/1N)n, i.e. as
the volume of the largest ball (or cube of a fixed orientation) that is contained in X and
avoids all N points of 3N. The result of Janson (1987) states that
80 Chapter 3
NVN-logN
lim =n-l (3.1.9)
N~oo log logN
converges in distribution to a random variable with c.d.f. exp{ -e-q, where ~=o if A is a
cube and
I-F N (t)
,n
lim n N =cn(t) (3.1.10)
N~oo N (1 - t)
holds for n=2 and each tE (0,1), where cn(t) is a constant depending on nand t. It is
widely known that for the univariate case (i.e. for X=A=[O, 1]) the relation (3.1.10) holds
with c n (t)=1.
Some further investigation of the properties of the uniform random search algorithm
can be found in Anderson and Bloomfield (1975), Yakowitz and Fisher (1975).
The uniform random search algorithm finds major applications in global random
search theory as a pattern in theoretical and numerical comparison of algorithms and also
as a component of many global random search algorithms. It is used also for investigating
diverse procedures of mathematical statistics.
The slow convergence rate of Algorithm 3.1.1 has served as a reason for creating a
great number of generalizations and modifications discussed below.
E'tB(£) = IIp(B(e)),
+
N = [logy/ 10g(1- P(B(E»)] '" - (logy)IP(B(E».
The practical significance of Algorithm 3.1.2 is connected mainly with the fact that it may
be used as a component in more complicated algorithms, see later Section 3.1.5. Besides,
if the independence condition for {xk} and the identity of their distributions are weakened,
then more practical algorithms can be constructed: such methods will be described in
Sections 3.3.2, 3.3.5.
A simple idea of covering of 'X by balls with centres at points generated at random is
close to that discussed in Section 2.2. It can be properly realized if e.g. Lipschitzian
information about the objective function is available. Section 3.1.4 describes the
corresponding algorithms.
A simple manner of including adaptive elements into the global random search
technique consists of determining a distribution for xk+ 1 as depending on the previous
point xk and objective function value f(Xk). The corresponding algorithms are called
Markovian (since the points x1,x2, ... generated by them form a Markov chain) and will
be studied in Section 3.3. Their theoretical properties are intensively studied nowadays,
but prospects of their practical usefulness are still not quite clear.
An important way of improving efficiency in global random search algorithms is
connected with the inclusion of local descent techniques, see later Section 3.1.6. A simple
algorithm of such a kind is the well-known random multi start consisting of multiple local
descents from uniformly chosen random points. It will be theoretically studied in Section
4.5. Its theoretical efficiency is rather low, but some of its heuristic modifications
described in Sections 2.1.3 and 3.1.6 can be regarded as fairly efficient for complicated
optimization problems.
Another important means of constructing efficient global random search algorithms
consists of using mathematical statistics procedures for deriving information about the
objective function. Such information can serve, in particular, to check the obtained
accuracy for many algorithms and to determine corresponding stopping rules. Chapters 4
and 7 are devoted to various problems connected with the construction of statistical
inference procedures and their application in global random search.
A further direction of improving global random search efficiency is to reduce the share
of randomness. For instance, Section 2.2.1 shows that the method consisting of
evaluation of f at quasirandom points is more efficient than Algorithm 3.1.1. However,
nonrandom points are in some sense worse, than random ones, due to the following
reasons: (i) generally, statistical inferences for deriving information about f can not be
drawn if the points are not random, (ii) if the structure of 'X is not simple enough, then
the problem of constructing nonrandom grids is usually much more complicated, than that
for random grids.
A stratified sample may be regarded "as an intermediate between random and
quasirandom ones. To construct such a sample of size N=m ~, the set 'X is divided into m
subsets of equal volume and the uniform distribution is sampled ~ times in each subset.
Already Brooks (1958) pointed out some advantages of stratified sampling as a substitute
for pure random search. But only recently the gains caused by this substitution were
correctly investigated. (Section 4.4. contains these results as well as suitable statistical
inferences.)
Many global random search algorithms are based on the idea of more frequent
selection of new points in the vicinity of good points, i.e. those earlier obtained ones in
which the values of f are relatively small. Corresponding methods (which go under the
name of methods of generations) will be considered in Chapter 5. Note that many methods
of generations can be used also for the case when a random noise is present in the
evaluations of f .
The approaches mentioned do not cover completely the variety of global random
search methods. Some of them have been described above (in particular, Algorithm 2.3.1
and the method of Section 2.1.6 based on smoothing the objective function). Many others
Main Concepts and Approaches 83
can be easily constructed by the reader: to do this, it suffices to enter a random element
into any deterministic algorithm. (This method was used e.g. by Lbov (1972) who
proposed to seek the minimum of the minorant (2.2.30) of the objective function by a
random sampling algorithm, and thus transformed the polygonal line method of Section
2.2.2 into a global random search algorithm).
for each k:91 n instead of (3.1.4) where c= T\-n is the volume of each T\-adic cube S(xi)'
Algorithm 3.1.3 does not use the information which is contained in the values of I and
so its efficiency can not be high. The following method of Devroye (1978) constructed
under the supposition IE Lip(X,L,p) is of higher efficiency.
(3.1.11)
k
Zk 1 =X\ UB(x.,T\ .,p)
+ i=l 1 1
instead of (3.1.11), where ~k>O, ~k~O for k~oo, then the algorithm converges almost
surely with respect to the values of f. Closely related general convergence investigations
can be found e.g. in Solis and Wets (1981) or Pinter (1984).
We shall describe here a formal scheme of global random search algorithms that may be
useful investigating some of their general properties.
Algorithm 3.1.5.
(3.1.12)
(3.1.13)
where <Pk is a distribution density in R.n. In order to obtain a random realization xk in 'X.
from the distribution (3.1.13), one needs to obtain a realization ~k in R.n from the density
<Pk, to check the relation z+ ~kE 'X. (if it does not hold, then a new realization ~k is
needed), and take xk= z+ ~k. The choice (3.1.13) is natural only in the case, when a
random noise is present in the evaluations of f. Similarly, the transition probabilities are
often selected in the form
(3.1.14)
Main Concepts and Approaches 87
where Tk(z,dx) is a Markovian transition probability having the fonn of (3.1.13). Given a
fixed z, in order to obtain a realization Xk from the distribution (3.1.14) one needs first to
get a realization ~k from the distribution Tk(z,.) and put
if f (~k) ~ f(z)
x
k
={~k
z otherwise .
Attempts to determine the details of objective function behaviour allover the set 'X, by
means of a small number of evaluations, cannot be successful; rough approximations of f
can be used for detennining the points where many evaluations of f should be carried out
for an expected deep local descent only in the simplest cases.
On the other hand, if there are good reasons to believe that some points x/k )
G=l, ... ,Nk) of the k-th iteration of Algorithm 3.1.5 are not very far from a global
optimizer, then local descent may be profitable. The depth of the descent is defined by the
transition probabilities Qk(z,.): for a fixed realization z of Rk, the algorithm of choosing
the point Xj(k+l) consists in carrying out several (possibly, one) local descent iterations
defined by the method of Qk(z,.) sampling. If Qk has the fonn of (3.1.13), then the local
descent is not perfonned and x/k +1) is chosen in a vicinity of z; if Qk has the fonn of
(3.1.14), which is possible only If there is no evaluation noise, then a single local random
search iteration is done (with return for unprofitable steps). One can directly see how the
method of sampling Qk should be defined in order to correspond to one or several
iterations of either other local random search algorithms (they are thoroughly described
e.g. in Rastrigin (1968», or of any detenninistic local descent (in the latter case the
distributions Qk(z,.) are degenerated for each z), or even of stochastic approximation type
algorithms (in the case of evaluation noise).
It is also evident how to make the number of local iterations from z to be proportional
to the evaluated value of f(z). The extremal case where sampling Qk(z,.) corresponds to
the transition from the point z to a local minimizer, (whose domains of attraction z belongs
to) is unlikely to be acceptable, except very simple situation. Optimization problems, of
course, occur where evaluation of the derivatives of f is rather simple; in this case, local
descent iterations may prove useful already at the first iterations. Corresponding
algorithms can be regarded as modifications of the random multi start method, see Sections
2.1.3 and 4.5.
It is worth to note that there is no need to know the analytical fonn of the transition
probabilities Qk(z,.), as well as the distribution Rk: one needs only an algorithm for their
sampling, i.e. for passing from z to the point of the (k+ 1)-st iteration The efficiency of a
global search algorithm, hence, may be improved if the complexity of an algorithm for
sampling Qk(z,.) is increased with k (including also additional evaluations of f). In doing
so, it seems natural to take smaller Nk for greater values k than for small ones. The
quantities Nk and the indices of iteration, where sampling algorithms for Qk(z,.) become
more complicated can be defined in advance (using the prior knowledge about the
objective function bahaviour and the accuracy of resulting approximations) as well as in
the course of the search (using the obtained information concerning f).
88 Chapter 3
Lq = 00 (3.2.1)
k=l k
Then for any 0>0 the sequence ofrandom vectors xl>x2, ... generated by Algorithm 3.1.5
with Nk=1 (k=I,2, ... ) falls infinitely often into the set A(I5) with probability one.
Proof. Fix 15>0 and then find £>0 such that B( £ )CA(I5). Determine the sequence of
independent random variables {11k) on the two point set {O,l} so as to obtain for fixed
£>0: Pr{11k=1 )=I-Pr{11k=O)~(x*,£). Obviously, the probability ofxk falling into B(£)
is, for all k=I,2, ... , not less than the probability ~ of 11k getting into the state 1, and,
therefore, the theorem's assertion will be proved if one demonstrates that the sequence 11k
infinitely often takes the value 1. Since the latter follows by (3.2.1) and the first part of
Borel's zero-one law, Theorem is proved.
Theorem 3.2.1 is valid also for the general case when function f is subject to random
noise. If the noise is absent, then the conditions of the theorem ensure that the point
sequence {xk) converges to the set X*={arg min f) of global minimizers with
probability 1.
In virtue of the Borel-Cantelli lemma, one can see that if
00
k=l
Main Concepts and Approaches 89
then the points xl,x2, ... fall into B(E) at most a finite number of times. Moreover, as is
illustrated by the function
XE 'X=[0,l],
where 10 is a sufficiently small positive number, (3.2.1) cannot be improved for a rather
wide class of functions:f (even for the class of analytical functions with two minima), i.e.
if (3.2.1) is not satisfied, there exist f E l' and 10>0 such that none of the points
x lox2, ... ,xN falls into B(E) with a positive probability where N is an arbitrarily great
fixed number.
Since the location of x* is not known a priori, instead of (3.2.1) some more strict, but
simpler requirement
00
(3.23)
liminf a. k > 0
k~oo
should be satisfied for the corresponding algorithm to converge, the weaker requirement
Ia. =00
k=l k
the above cases, it is obviously sufficient that for each k=I,2, ... the density CPk of
(3.1.13) be represented as CPk(x)=13k- n cp(x/13k), where (13k) is a nonincreasing sequence
of positive numbers and the density cp(x) is symmetrical, continuous, decreasing for x>O,
decomposes into a product of one-dimensional densities, and that the condition
dn+E
f 13~n cp(x/13k) dx = 00 (3.24)
dn
be satisfied where di is the diameter of 'X. with respect to the i-th coordinate, £>0 is
arbitrary. It is, of course, difficult to check (3.2.4) in the general case, but it becomes
rather simple for some particular choices of cpo Consider e.g. the following methods of
choosing cp:
(3.25)
(3.2.6)
where x=(x(l), ... ,x(n», and Ai>O (i=l, ... ,n) are arbitrary constants. The coordinates of
random vectors distributed with densities (3.2.5) and (3.2.6) are independent and
distributed according to the Laplace and Gaussian distributions, respectively.
Lemma 3.2.1. Let cp be defined according to (3.2.5). Then (3.2.4) is satisfied if for any
b>O the following relation holds:
(3.27)
In partiCUlar, 13k (k~3) can be chosen as 13k =q/log(C2 log k) where q>O, cpl are
arbitrary.
Proof. Set A=max ( AI, ... , An). The condition (3.2.4) will be satisfied for the density
(3.2.5), if it is satisfied for the density
co d+e
L f ... f
d+e -n
~k <p*(x/~Jdx=
co
LL l-e
fl( -Ae//3k)
e
-Ad!l3 k J =00
n
k=1 d d k=1
for all k=1,2, ... , the latter condition does hold if (3.2.7) is met with b=exp(-nAd).
The lemma is proved.
Similar reasoning takes place for (3.2.6), where instead of (3.2.7) we have
2
00 l//3 k
L b = 00,
k=l
for c 1 > 0, c 2 ~ 1, k ~ 3.
Theorem 3.2.2. Let f be a bounded function on 'X. attaining its global minimum at the
unique point x*. Assume, that in the vicinity of x*, f is continuous, and for any £>0
there exists 0=0(£) such that
to _ vrai inf
'::'kE(A(e))
k P k+l ( { x E 'X.: f (x) < f~ - o} ) = 00. (3.28)
(3.3. 1)
Let Yk (k=I,2, ... ) be a result of evaluating the objective function f at a point Xk (may be,
this evaluation is subject to noise); further let Pk+ 1 (.)=Pk+ 1(.1 xl,Yl , ... ,xk,yk) be a
probability distribution of a point xk+ 1 generated by a global random search algorithm.
The Markovian property of the algorithm means that for all k=I,2, ... the distributions
Pk+ 1 depend only on xk,yk, that is
(3.3.2)
If the evaluations of f are not subject to noise, then (3.3.2) takes the fonn Pk+l (.1 Uk)=
Pk+ 1(.I xk,Yk)·
In essence, Markovian global random search algorithms are modifications of local
ones, alternating local steps with global ones. If the evaluations of f are subject to noise,
these algorithms are sometimes named multi extremal stochastic approximation algorithms
due to the analogy with the local case (for instance, see Vaysbord and Yudin (1968».
Different variants of Markovian global random search algorithms were proposed and
studied by many authors, starting from the late 1960's. In spite of abundance of works,
many algorithms differ only in secondary details. Below we shall describe the principal
algorithms, starting with a general scheme for the case 'Xc:Rn.
1. Sampling a given distribution PI, obtain a point Xl. Evaluate Yl, the value (may
be, subject to noise) of the objective function f at the point x 1. Set k= 1.
94 Chapter 3
2. Obtain a point zk from R,n by sampling a distribution Qk(xk,.) with the density
(}k(xk,x) depending on k and Xk.
3. If zke 'X, return to Step 2. Otherwise evaluate 11k, the value of f in zk (11k may be
subject to noise) and set
with probability p k
(3.3.3)
with probability 1- p k
where Pk=Pk(xk, zk, Yk, 11k) is the acceptance probability and may depend on k, xk, zk,
Yk,11k·
4. Set
if Xk+l =zk'
if xk+l = x k·
5. Check some given stopping criterion. If the algorithm does not stop, then return to
Step 2 (substituting k+ 1 for k).
An ordinary way of realizing (3.3.3) consists of calculating Pk, obtaining a random
number (lk, checking the inequality (lk.$Pk, and setting
if
if
Particular choices of initial probability PI, transition probabilities Qk(x,.), and acceptance
probabilities Pk(x,z,y,11) lead to a concrete Markovian global random search algorithms.
To obtain convergence conditions the results of Section 3.2 can be used: however, the
simplicity of Markovian algorithms allows to get more specific results on their
convergence and convergence speed. These are interesting theoretical results, together
with simplicity and plain interpretation which explain the popUlarity of such algorithms.
One of the most well-known is simulated annealing, considered first.
L CJ (xi)'
x.eX
1
r1 if~k::;;;O
Pk=min{ l,exp(- Pk~J} =i (3.3.4)
lexp(-Pk~J if ~k>O,
with probability
with probability
This means that the promising new point Zk (for which f (zk}sf (xk)) is accepted
unconditionally, but the non-promising one (for which f (zk»f (xk)) may also be accepted
with probability Pk=exp{ -Pk~k}. As the probability of accepting a point which is worse
than the preceding one is always greater than zero, the search trajectory may leave a local
and even global minimizer. (Note that the probability of acceptance decreases if the
difference ~k=f(Zk)-f(Xk) increases.)
The expression (3.3.4) for the acceptance probability Pk is motivated by the annealing
process modelled in the simulated annealing method. In statistical mechanics, the
96 Chapter 3
probability that the system will transit from a state with energy Eo to a state with energy
Et. where Llli=EI-Eo >0, is exp(- Llli/KT) where K=1.38 10- 16 erg/T is the Boltzmann
constant and T is the absolute temperature. Thus f3=1/KT and the lower the temperature
is, the smaller is the probability of transition to a higher energy state.
In a more general case, when Zk is distributed according to a transition probability
Qk(xk,.), the simulated annealing method is a typical Markovian method (see Algorithm
3.3.1): the only particularity being in the form (3.3.4) of acceptance probabilities.
If f3k=f3=1/KT, i.e.
and Qk(x,.)=Q(x,.) are not dependent on the count k, then the point sequence {xk}
constitutes a homogeneous Markov chain, converging (under rather general conditions on
Q and f) in distribution on the stationary Gibbs distribution having the density
(Density is meant with respect to the Lebesgue measure in the continuous case, and with
respect to the uniform measure in the discrete case.) Consequently, as T-+O (or /3-+00),
the Gibbs density 1tT defined by (3.3.6) tends to concentrate on the set of global
minimizers of f (subject to some mild conditions, see Geman and Hwang (1986), Aluffi-
Pentini et al. (1985». In particular, if the global minimizer x* of f is unique, then the
Gibbs distribution converges to the ~-measure concentrated at x* for T -+0.
Numerically, if T is small (i.e. /3 is great), then the points Xk obtained by a
homogeneous simulated annealing method have the tendency to concentrate near to the
global minimizer(s) of f. Unfortunately, the time required to approach the stationary
Gibbs-distribution increases exponentially with Iff and may reach astronomical values for
small T (as confirmed also by numerical results). This can be explained by the fact that for
small T a homogeneous simulated annealing method tends to be like the local random
search algorithm that rejects unprofitable steps, and so its global search features are poor.
The homogeneous simulated annealing method is the particular case of the above
mentioned Metropolis algorithm that uses the acceptance probabilities
if w (z) ~ w(x)
p(x, z) = min {I, w(z)/w (x)} = { 1 (3.3.7)
w(z)/w(x) if w (z) < W (x)
for this purpose, where g is an arbitrary function, O~~1, go' O. For instance, for
g(t)=l!(1 +t) we have
w(x)
1
if w (z) ~ w(x)
w(x) + w(z)
p(x,z) =
w(z)
if w (z) < w (x).
w(x) + w(z)
Another way of constructing stationary Markov chains having the stationary density
(3.3.8) is due to Turchin (1971): it consists in setting p(x,z)=1,
Let us tum now to the case of unhomogeneous simulated annealing methods that use
~k~oo (or Tk~O) for k~ and, probably, transition probabilities Qk(x,.) depending on
k. It can be seen that the convergence of the distribution of points xk, generated by the
simulated annealing method to a distribution concentrated on the set of global minimizers
of /' can be guaranteed if Tk tends to zero slowly enough. For standard variant of the
method, the choice Tk=c/log(2+k) is suitable where c is a sufficiently large number
depending on / and 'X, see Mitra et al. (1986). Anily and Federgruen (1987) proved
general results on the above convergence and the rate of convergence of the generalized
discrete simulated annealing method described by general transition probabilities Qk(x,.)
and general acceptance probabilities Pk(x,z) tending to either zero, one or a constant for
the cases f(x)</(z), /(x»/(z), or f(x)=/(z), correspondingly.
98 Chapter 3
A particular case of the above generalized simulated annealing method was proposed
and numerically investigated by Bohachevsky et al. (1986): the acceptance probabilities
have the form
r1 if f(z) ~ f (x)
(3.3.11)
p(x.z)=i
l exp{ - ~[f(x) - f minJ~ (f(z) - f(x»)} if f(z) > f(x)
where ~ is an arbitrary negative number (for ~=O (3.3.11) coincides with (3.3.3)) and
f min is an estimate of f* =min f. If, for some x, the value f (x)-f min becomes negative,
then it is proposed to decrease f min and continue the search.
The Reader interested in numerical realization of the simulated annealing concept is
referred also to Corana et al. (1987), Haines (1987), van Laarhoven and Aarts (1987).
(3.3.12)
where W t is the standard multivariate Brownian motion. A detailed study on the solutions
of stochastic differential equations of the type (3.3.12) is contained in Geman and Hwang
(1986). In particular, it is shown that the trajectory Xt, corresponding to the solution of
(3.3.12), has a stationary distribution with the Gibbs density (3.3.6) under some
conditions on f and 'X and T(t)~T=const, when t~oo. Moreover, the following is also
shown for the case when the global minimizer x* of f is unique: if there exists an
extension of f to an open set containing 'X that is twice continuously differentiable and
has no local minimizers outside 'X, t~oo, and T(t)=c/log(2+t), where c is a sufficiently
large real number, then the trajectory of the process Xt solving the equation (3.3.12) has
limit distribution concentrated at x*.
A discretized version of the stochastic differential equation (3.3.12), obtained with the
help of standard methodology, is the Markov chain
(3.3.13 )
(3.3.14)
if
(3.3.15)
if
where r: :R.~(0,1] is a nondecreasing function, Yk and T1k are results of evaluation of fat
points Xk and Zk (the latter point is distributed according to Qk(xk,.». Correspondingly,
the essence of the algorithm can be expressed by the rule
100 Chapter 3
and
(3.3.16)
otherwise
where uk is a random number. Comparing (3.3.4) with (3.3.15) we may hence conclude
that in a sense the simulated annealing algorithm and (3.3.16) are opposed. The former
method accepts each profitable step and also some unprofitable ones, while the latter
rejects all unprofitable steps and even some profitable ones. Of course, the
reasonableness of the algorithm (3.3.16) is due to the presence of random noise in the
evaluations of f. Zielinski (1980) and Zielinski and Neumann (1983) proved the
following.
Let us now return to the case where f is not subject to random noise and consider the
variant of Algorithm 3.3.1 in which
if
(3.3.17)
if
and the transition densities ~ have the form ~(xk,x)=CP(X-Xk) where the density cp is
supposed to be given on :Rn, continuous in a neighbourhood of zero, and cp(O»O. Note
that (3.3.17) implies that the acceptance probabilities Pk of Algorithm 3.3.1 are
if f(Zk)::; f(x k)
if f (zk) > f (x k )
Main Concepts and Approaches 101
The convergence of the above algorithm was studied first by Baba (1981): this is the
reason for the algorithm being referred to as Baba's algorithm.
Dorea (1983) proved the following result on the rate of convergence of the algorithm
(note that a statement on convergence follows from this result).
where Jln is the Lebesgue measure and let EVB be the mean number of random vectors zi,
obtained in Baba's algorithm, required for a sequence {xk} to attain the set A(8). Then
L~EvB 9--1 and the quantity L1 attains its minimal value for the case when the algorithm
at hand coincides with Algorithm 3.1.1.
We shall present now a more refined result on the convergence rate of a particular case
of Baba's algorithm, revising some unpublished results of V.V. Nekrutkin; other
asymptotical properties of the algorithm are investigated in Dorea (1986).
Assume that X=[O,l]n is the unit cube, the rule (3.3.17) is applied and the transition
density ~=q has the form
(3.3. 18)
where (Xe [0,1), ae (0,2] are parameters of the algorithm, Po(x)=l, xeX is the uniform
density on X, furthermore
b(a, z) if x e :Da(z)
'Pa(z,x) ={ (3.3.19)
where
° otherwise,
(Under a fixed ze'X, the function (3.3.19) is the uniform density on the set :Da(z) which
is the intersection of the cube 'X and the cube with the centre at a point z and a side length
a.)
The choice of the transition density (3.3.18) implies that on each iteration of the
algorithm the uniform distribution on 'X (with probability I-a) or the uniform distribution
on :Da(z) (with probability a) is sampled. Note that in two particular cases (for a=O or
a=2) the algorithm at hand coincides with Algorithm 3.1.1, since the density (3.3.18)
becomes the uniform density on 'X. Note also that from the definition of b(a,z) there
follows
(3.3.20)
here the left-side relation becomes equality, when each coordinate of a point z lies in the
interval [a/2, l-a/2], and the right-side inequality does for the vertices of the cube 'X.
Before studying the properties of the algorithm, we formulate an auxiliary assertion
well-known in the theory off\.larkov processes.
Lemma 3.3.1. Let XbX2, ... be a homogeneous Markov chain, 'tC be the Markovian
moment of first hitting into the set cc 'X, Ex'tC be the expectation of 'tC provided XI =x,
A and B be two measurable subsets of 'X,
Pr x{x'tB e dz}
be the conditional distribution of the vector x'tB provided xI =x. Then the relation
(3.3.21)
is valid.
The inequality (3.3.21) shows that the mean time before reaching the set A by a
trajectory of the Markov chain does not exceed the mean time for reaching this set by those
trajectories which have to visit B before reaching A.
Let us take A=:De(x*), where x* is a global maximizer, and
where e (0< e <a) defines a required solution accuracy, and estimate the mean number of
iterations for reaching A by Baba's algorithm with transition densities determined via
(3.3.18) and (3.3.19).
By virtue of continuity of f in a neighbourhood of x*, there exists such a constant ~>O
(depending on f, but independent of a, e) that Iln(B)~~(a-e)n. Applying the inequalities
Main Concepts and Approaches 103
(3.3.20) and (3.3.21), together with the fact that, for each XE B, the value I (x) does not
exceed the values of f outside the set Da-E(x*), we obtain
-1
= (1- a + aa-n ) b(E,x*) ~ 2 n I[ (1- a + aa-n)En],
Define
<I>(E, a, a) = 1 + 2n
----=-_=_~ (3.3.22)
(l-a)~(a-E)n (l-a+aa n)E n '
then the obtained estimate for the mean number of iterations to be performed for reaching
the set A=DE(x*) can be written as
Let us investigate now the question concerning the rate of increase of <I>(E,a,a) for E-70.
From (3.3.22) it follows that if a and a do not depens on E, then the order of increase of
<I> equals £-n, E-70, and coincides with the order of increase of the quantity E'tA for
Algorithm 3.1.1 for which
for E -70.
Then
1 2n
CP(E,a(e), ex) = n + -n x
(1- a)/3(<p(e)) (1- a)£n + a(1 + <p(E)/e)
-n
(<p(e)) 2n n
(1- a)13 + a-(<p(e)/e) for 10 ~O.
Hence, for <p(e)'='e, e~O, the order of increase of <11(10, aCe), a) is e- n/2: this is optimal
since for a(e)~const>O, a(e)~O, a(e)~l, as well when omitting the supposition <p(e)!
e~oo for e~O, the order e-n/2 is getting worse.
Now set a(e)=a, a(e)= e+d.,JE and find the optimal values of the constants a and d.
Transforming <II we obtain
for 10 ~O.
Thus the problem of (approximately) optimal selection of the constants a and d is reduced
to the minimization problem of the function
-1 n
'P(a, y) = [(1- a)13y] + 2 yla,
where y=d n , on the set aE (0,1), y>O. Setting to zero the partial derivatives of the
function 'P we get a unique solution of the obtained simultaneous equations: a=O.S,
y=2- n/ 213- 1/ 2 . Hence, it follows that the quasi-optimal values of parameters of the
algorithm are: a=O.S,
-l/2n. ~
a=a(e)=e+13 v 10/2;
Assume that the dimension n of 'X is high. Although the usual versions of random search
algorithms are generally in this situation much simpler than the deterministic algorithms,
they are, nevertheless, rather labourous because n-dimensional random vectors have to be
sampled at each iteration. The relative numerical demand of the algorithms of this section
for great n is very modest - only two random variables are to be sampled at each k-th
iteration (10 1) of these algorithms.
The approach below can be referred to as a modification of the method considered in
Section 3.3.2 and consists of sampling a homogeneous Markov chain with a given
stationary density.
Choose a nonnegative function 'P defined on X such that the univariate densities
proportional to anyone-dimensional cross-section of the function are easily sampled and
the transition probabilities Qk(z,.) as follows
where T(z,dt) for each ZE X is a probability distribution on the set of lines passing
through the point z and
n
T(z,dt)= L q.O(z,dt- t.) (3.3.24)
i=l I I
where
n
q. >0
I
(i= 1, ... ,n), L q. = 1, (3.3.25)
i=l I
o(z,dt-ti) is a distribution concentrated on the line passing through z and parallel to the
vector ti ; {t1 ,... ,t n } is a set of n-dimensional linearly independent vectors; (x,t) denotes
the linear coordinate of a point x on the line t (projection of x on t); c(z) for each z is a
normalization constant. The transition probability (3.3.23) is sampled applying the
superposition technique: first, a random line is chosen passing through z and parallel to
one of the vectors ti by sampling from the distribution induced by the probabilities
(3.3.25) and then a one-dimensional distribution is sampled on the line with the density
proportional to 'P.
Note that the most natural way of choosing qi is qi=l/n (i=1, ... ,n) while the vector
set {tl>... ,tn } can be chosen as the set of the coordinate vectors. Note also that the above
way of choosing the transition probability Q(z,.) draws on the idea of Turchin (1971).
Proposition 3.3.2. Let the function 'P be non-negative and piecewise continuous, the
set X'={XE X: 'P(x»O} be connected, 'P(x1»0, and a homogeneous Markov chain be
sampled with transition probability (3.3.23) for which (3.3.24) is satisfied. Then the
sampled Markov chain has a stationary distribution whose density is proportional to 'P.
106 Chapter 3
Proof Under the above formulated conditions the transition probability (3.3.23) meets the
Doeblin condition, i.e. there exists a natural number m, two real numbers fIE (0,1) and
102>0, and probability measure v(dx) on (X,B) such that for XE X' there follows
QID(X,A)~E2 from V(A)~E1' ACX'. Indeed, if one chooses
then the Doeblin condition means that the density qID(Z,X) of the probability of transition
from any point ZE X' to any point XE X' in m steps is positive which is true (also) for
m=n. Since the Doeblin condition is met, the Markov chain under study has a stationary
distribution with the density p(x) which is the unique solution (in the class of all
probability densities) of the integral equation
p(x)= Jp(y)q(y,x)dy.
The fact that 'P is a solution of this equation follows from Turchin (1971): this proves the
proposition.
Sampling of one-dimensional densities will be especially simple if constant or
piecewise constant functions are used as 'P. If exp{ -A.f (x)} or another function connected
with f is used as 'P, then the stationary distribution density will be proportional to it, but
one may face a hard sampling problem for one dimensional distributions: if sampling
relies on the rejection technique with a constant majorant, then the efficiency of the
resulted optimization algorithm for f will not differ appreciably from that of Algorithm
3.1.1. If one succeeds to employ more efficient sampling methods (e.g. that of inverse
functions, if the cross-sections of'P along some directions have readily sampled forms, or
the rejection technique having good majorants), then the efficiency of the algorithm
derived may be significantly superior to that of Algorithm 3.1.1. These devices work only
when one knows the analytical form of f; otherwise the procedure described below can be
used. This procedure relies upon the fact that under certain conditions an appropriately
chosen subsequence of a stationary Markov sequence with stationary density proportional
to 'P may be regarded as a stationary Markov chain with stationary density proportional to
w (exp{ -A.f) or any other function related to f may be used as w).
Before formulating the procedure we shall introduce several notations and prove an
auxiliary assertion. Let (X,B) be a measurable space, PI be a probability distribution on
it, Q(z,dx) be a Markov transition probability, P(z,dx)=(1-g(z»Q(z,dx) where O::;g::;l,
(g(z) is the probability of termination of the Markov chain with transition probability
P(z,dx) in a point ZE X), 't be the termination moment of the Markov chain Xl ,X2, ... with
initial distribution PI and transition probability P(z,dx), i.e.
(3.3.26)
where Xl has distribution Plo Xi (i=2,3, ... ) has distribution Q(xi-1,.), Uj are random
numbers.
Main Concepts and Approaches 107
Below the minimal positive solution of an integral equation, i.e. the one obtained
using the method of successive approximation, is taken as the solution. For example, for
the integral equation
G(dx) = L P m(dx)
m=l
P m(dx) =f PI (dz)P
m-l
(z,dx),
Note that the minimal positive solution always exists and is unique.
Theorem 3.3.1. With the above notation, the following statements hold:
is the probability distribution of the random vector ~'! where ~ 1,,,., ~'! are random vectors
connected into the Markov chain with initial distribution PI and transition probability P,
1 We have
00
stands for the probability of termination of the Markov chain at the m-th step. Thus
108 Chapter 3
00
00
= I. 5 (1- g(x 1)' . -( 1- g(x m_ l )) g(x m) P l(dx l)Q(x l,dx 2 )···Q(x m-l' dx m) 1 A (x m) =
m=l x m
00
Since Pr{ 't<oo }=1, it follows from the first assertion and from the fact that G is a measure
that F is the probability distribution of the random vector x't.
00
m-l 00
00 00 00 ook 00
G(X) =JJG(dz)P(z,dx) + 1
which is equivalent to
whence taking into consideration the fIrst assertion and the fact that G(X)<oo we get
Pr{ 't<oo }::::1. The theorem is proved.
Notably, if the Markov chain x1,x2, ... starts at a fIxed point xeX, i.e. if xI ::::x with
probability one, then the measure G corresponds to the potential kernel of the chain in
potential theory, cf. Revus (1975). In this theory (3.3.28) was known for g(x)::::lA(x),
ACX.
Theorem 3.3.1 can serve (and did serve for the author) as a basis for the creation and
investigation of sampling algorithms for a wide range of probability distributions. The
approach is thoroughly described in Zhigljavsky (1985).
Denote by ~l> ~2' ... a homogeneous Markov chain with an initial distribution Fo and
Markov transition probability Q(z,.). Set P(z,dx)::::(l-g(z))Q(z,dx), where g is a function
on 'X, O~gsl, 'to::::O,
where CXj are random numbers. According to Theorem 3.3.1 the random vectors ~'tk
(k::::l,2, ... ) have distributions Fk representable by Fk(dx)::::g(x) Gk(dx) in terms of Gk,
the solution of the equation
It is proved below that under some assumptions, for any initial distribution Fo the
consecutive approximations (3.3.30) weakly converge to the solution of the equation
Proposition 3.3.4. Let X be a compact subset of :R. n , for some m~l the m-step
transition probability pm(z,dx) is strictly positive (i.e. pm(z,:O»O for any ZEX and each
open hyperrectangle :OCX), the method of successive approximations for (3.3.32)
converges in the metric of C(X) for any function vk-l continuous on X, the family of
functions {vk} is equicontinuous.Then F(dx)=g(x)G(dx), where G is the solution of
(3.3.31), is a probability distribution and the sequence of distributions
Fk(dx)=g(x)Gk(dx) (k=1,2, ... , Gk are the solutions of (3.3.30» weakly converges to
F(dx).
whence (VI (z)- M1)Jg(z)+ M 1.90(z) for all ZE X. By passing to maxima obtain that
M 1.S.M o (in doing so, it is evident that if VI (x);z:M 1 then g(x 1*»0 where Xl * is a
maximizer of VI). From the strict positiveness and continuity of Vo and VI obtain that
M 1=Mo if and only if vI (x)=M 1 (and, consequently, vo(x)=M 1).
In virtue of the Arzela theorem there exists a subsequence vkm of the sequence vk that
uniformly converges to a continuous function uo . Then vkm+ 1 converges to u 1, the
transformation (3.3.32) of uo. Since the numerical sequence max vk is monotone and
bounded,
M=limmaxv k
k~oo
exists. Since the convergence to uo and u I is uniform, max uo = max u 1=M and,
therefore, uo= u 1=M. The limit, thus is independent of the choice of vkm, hence, vk~M
uniformly for k~oo .
Now we shall prove that
in virtue of the Lebesgue theorem on dominated convergent sequences. Thus, we see that
f VI (x)Fk(dx) converges for k~oo for any continuous function VI, whence it follows by
Feller (1966), Section 1 of Ch.8 that there exists a probability measure F that is a weak
limit of {Fk}' The validity of (3.3.31) now obviously follows from (3.3.30) and the
proposition is proved for the case of m= 1.
In order to pass to arbitrary m, apply the above assertion m times and substitute
pm(z,.) and Fi(dx), respectively, for P(z,.) and F1(dx) (i=l, ... ,m). The limit of the
sequences
{F mk+i (dx)} 00
k=O
will be the same and, therefore, {Fk} also will converge to this limit. The proof is
complete.
Proposition 3.3.4 is a generalization of the ergodic theorem for Markov chains from
Feller (1966) to the case of ~1 and function g that is not identically equal to 1.
Rewrite now (3.3.31) as
By assumption, the transition probability Q(z,.) is chosen so that all solutions of the
equation have the form
112 Chapter 3
C'¥(X)dX' XE 'X
(1- g(x»G(dx) = { (3.3.34)
0,
where c is arbitrary positive constant.
Let us require that the stationary distribution F of the sequence ~'tk be proportional to
w, i.e. that the relation
If the behaviour of '¥ resembles that of w, then p is close to one and the algorithmic
efficiency is high. Otherwise (if'¥ is small where w is large and large where w is small)
both p and the efficiency are low. The profile of behaviour of,¥ thus, should be close to
that of w. If prior information concerning the behaviour of f is missing, then the function
should be estimated in the course of optimization.
The above algorithm can be generalized for the situation where evaluations of w(x) (in
partiCUlar, we can use w(x)= - f(x)+const) are subject to a random noise 11(x), w+11?O
a.s. In this case the analogue of (3.3.36) is as follows
and the subsequence ~'tk has a stationary distribution proportional to the function
Main Concepts and Approaches 113
-1
h(x) = {E[ w (x) + 11 (x) + \!'(x)] -I} - \!'(X)
rather than to w(x). The behaviour of this function is related to some extent to that of
w(x). Indeed, let the distribution of the random variable 11 (x) is independent of x and two
points x, ZE'X be chosen such that \!'(x)= \!'(z) and w(x»w(z), then simple calculations
give h(x»h(z).
CHAPTER 4. STATISTICAL INFERENCE IN GLOBAL RANDOM
SEARCH
This chapter considers various approaches to the construction and investigation of global
random search algorithms based on mathematical statistics procedures. It appears that
many algorithms of this chapter are both practically efficient and theoretically justified.
Their main theoretical feature is that convergence of the algorithms studied does not hold
in the deterministic sense or even in probability, but is valid with some reliability level
only. Thus, the algorithms can miss a global optimizer, but the probability of this
unfavourable event is under control and may be guaranteed to be arbitrarily small.
The chapter contains six sections.
Section 4.1 is auxiliary and specifies the ways of applying mathematical statistics
procedures for constructing global random search algorithms.
t,
Section 4.2 treats statistical inference concerning the optimal value of a function on
the base of its values at independent random points: much attention is paid to the issue of
accuracy increase by using prior information about the objective function.
Section 4.3 describes a general class of global random search methods which use
statistical procedures; furthermore, it generalizes the well-known family of the branch and
bound methods, permitting inferences that are accurate only with a given reliability (rather
than exactly).
Section 4.4 modifies the statistical inference of Section 4.2 for the case, in which the
points where f is evaluated are elements of a stratified sample rather than an independent
one. It also demonstrates the gains implied by the stratification and evidences that the
decrease of randomness, if possible, generally leads to increased efficiency.
Section 4.5 presents statistical inference in the random multistart method which was
described in Section 2.1 and serves as a base of a number of efficient global random
search algorithms. The statistical inferences can be applied for its control and
modifications.
Finally, Section 4.6 describes statistical testing procedures for distribution quantiles
based on using the so-called distributions neutral to the right. They can be optionally
applied for checking the accuracy in a number of global random search algorithms, as well
as for the construction of a particular algorithms of the branch and probability bound type.
As earlier, the feasible region X is supposed to be a compact subset of Rn, having a
t
sufficiently simple structure, the objective function can be evaluated at any point of X
without noise. In contrast with the preceding chapters, the maximization problem of is t
considered rather the minimization one. (This decision serves for an easier application of
the related results of mathematical statistics; note that the transcription between minimum
and maximum problems is obvious.)
is one of the unknown parameters of the nonlinear regression model describing the
cumulative distribution function (c.d.f.)
The cumulative distribution functions (4.1.2) and the techniques for their estimation are of
considerable importance in global random search theory. One should bear in mind the fact
that since the behaviour of (4.1.2) is of primary interest for those values of t for which
F(t) is close to one, it is unreasonable to estimate F(t) for all t. This fact determines the
specific character of the problem: various solution approaches will be considered in
Section 4.2 and in Chapter 7.
Kernel (non-parametric) density estimates are useful for global random search theory.
Two reasons for this are: the densities that are kernel estimates can be sampled without
knowing their values in points and the properties of kernel estimates are well-studied.
U sing kernel estimates as an example, we shall demonstrate how the non-parametric
density estimates can support the construction of global random search algorithms in the
case of possible presence of a random noise and high labour consumption during the
function evaluations. Assume that we are given a sample { xI ,... ,xN} from a distribution
with density p(x) and that y(x)= f(x)+~(x)~C1~O with probability 1, where E ~ =0.
Choose another density q>(x) on Rn and consider densities (kernels)
-n
q>~(x) = /3 q>(x//3), /3 > 0,
induced by it.
Let us demonstrate (for strict proofs see Section 5.2) that for large N the density
Statistical Inference 117
q R(X) =
p
~
i=1
[y(x.)/
1
I.
j=1
y(X .)]<P R(X - X.)
J p 1
(4.1.3)
and for N~ the following relations hold (in virtue of the law oflarge numbers):
1 N
N L f(x.) -7 ft(x)p(x)dx,
i=1 1
1 N
N L ~(x.)-70,
i=1 1
1 N
N L f(x .)<p R(X - X.) -7 ft(z)<p R(X - z)p(z)dz,
i=1 1 p i p
1 N
N L~(X·)<PR(X-X.) -70.
i =1 1 p 1
N
y(x .)/ L y(x.)
1 j=1 J
are regarded as the points zi of a sample from the distribution with density (4.1.5), which
is true asymptotically for N -700, then (4.1.3) is the kernel estimate
118 Chapter 4
-I N
N L <P R(X - z. )
i=1 p 1
of the density r. This one reflects the features of f. For example, if p is the density of the
uniform distribution on 'X then r(x)= f (x)/ f f (z)dz, i.e. r is proportional to f. The
smoothed density (4.1.4) also has this property under a reasonable choice of f3 and the
sufficient smoothness of f. The choice of f3 is discussed in the corresponding literature
(c.f. for instance, Devroye and Gyorfi (1985». Essentially the less f3 is, the higher is the
kernel estimate accuracy near to the points xi and the less regular is the curve
corresponding to this estimate. Roughly speaking, f3 should not be unduely small and its
choice should be coordinated with the choice of N.
Therefore the density (4.1.3) may be expected to reflect the basic behavioural aspects
of f (under good behaviour of f, large N, and reasonable values of f3), i.e. to be large
whenever f itself is large and small whenever f is small. By studying (4.1.3) we, thus,
are studying (approximately) the behaviour of the objective function. The density (4.1.3)
can be studied without evaluating it at points (the latter may be rather difficult because of
the prohibitive number of points to be evaluated), but rather by making a suitable sample
from the distribution with this density. The distribution with density (4.1.3) can be
generated by means of the superposition method and presents no principle difficulties.
Having a sample {xl, ... ,xN} from a distribution with density p, one can estimate the
mode (i.e. the maximizer of p) and the level surfaces of this density, i.e. the sets
One of the most convenient techniques of the mode estimation is the following. Define a
numerical sequence (m(N)} so that the asymptotic relations
1!2
m(N)= o(N), (NlogN) /m(N)=o(l)
hold for N~oo. Select the least-radius ball among those having their centres at the sample
points and containing at least meN) points. Then the mode of the distribution with density
p is estimated by the vector consisting of the average coordinates of those points that are
inside the ball. For estimating the mode domain n={xe'X: p(x)~h}, bounded by the
surface (4.1.6), one should take in the above procedure the convex minimal-volume
envelope of meN) sample points instead of the minimal-radius ball. The above and similar
procedures have been thoroughly investigated by Sager (1983) for the unimodal density
case. It seems that some of the results can be generalized for the case, when the density
belongs to some classes of multiextremal functions (in particular, small local maxima far
from the global maximizer do not influence). To reduce possible errors and information
losses, the above procedures should be simultaneously applied to several subdomains,
say, by separating clusters at the beginning.
Statistical Inference 119
There are several basic connections between the theory and methods of statistical
modelling and global random search. First. since the global random search algorithms are
based on sequential sampling from probability distributions. sampling algorithms are their
substantial part: the theory of global random search generates in this case specific
sampling problems. For example. the construction of random search algorithms for
extremal problems with equality constraints is possible only, if there exist suitable
sampling algorithms for probability distributions over surfaces. The problem of sampling
on surfaces will be discussed in Section 6.1. Other sampling problems arise e.g. when
optimizing in functional spaces, see Section 6.2.1.
Second. that part of statistical modelling theory which deals with optimal organization
of random sampling for estimation of multidimensional integrals is of importance also in
the theory of global random search. Of special interest in this case is the problem of
simultaneous estimation of several integrals which is exemplified as follows. Let the
distribution sequence {Pk} weakly converge for k~oo to a probability measure
concentrated at the global maximizer x* (the construction of such { Pk } is described in
Section 2.3.3 and Chapter 5). For sufficiently large k. the estimates of JJ (x)Pk(dx) and
JxU)Pk(dx) O=l •...•n) are. respectively. the estimates of max J and x*U) where xU) and
x*U) are the coordinates of the points x and x*, respectively. One more integral, the
normalization constant of the distribution Pk is added to the above n+ 1 integrals.
Moreover. simultaneous estimation of several integrals using the same sample underlies a
number of non-parametric regression estimation methods (see Section 8.3). Finally. many
useful characteristics of the objective function (e.g. mean values on subsets of 'X) are
representable as linear integral functionals of either the objective function itself. or
functions/measures closely related to it (functionals of the form of (5.3.6) or functionals
of estimates of J).
Third. various procedures created within the framework of Monte Carlo theory in an
effort to reduce the variance of linear integral functional estimates are used for construction
of optimization algorithms. For example. Section 4.4 considers stratified sampling. while
Section 8.2 will analyse importance sampling. Let us demonstrate that dependent sampling
procedures may also be used in the search algorithms.
Let 'XcR.. the objective function J be sufficiently smooth and subject to random
noise. J (x)=E J (x.ro) where f (x.ro) is the result of a simulation trial conducted at a point
XE'X under the random occurance ro that in practice is a collection of random numbers
(i.e. realizations of the random variable uniformly distributed over the interval [0.1]). Let
l'
the derivative (Xc) at a point XOE'X be estimated and the approximations
1 N
J(xo-h) ... Ni:J(X O - h.ro'i) (4.1.9)
120 Chapter 4
be used for the estimation of t'(x o ) where h is a small value, N is an integer (resp. the
step size and sample size).
In ordinary simulation experiments, all random elements Oli and COi' are different and
the error of the estimator of f(x o) arises from the error of the approximation (4.1.7) (that
asymptotically equals h2 1 f'(x o ) I for h~O), and the error due to randomness of (4.1.8)
and (4.1.9). The latter error is estimated by the value
is the variance of the estimate f (Xo, co) of f (xo). Thus, one can refer to
(4.110)
as the error of the approximation (4.1.7) - (4.1.9) under different random occurance coi
and Oli'. If N~oo, h=h(N)~O, h(N)Nl/2~O then (4.1.10) approaches zero while N~oo
and the optimal sequence hopt(N) minimizing (4.1.10) is determined by
While using the dependent sample procedure one uses coi'= coi for i=1, ... ,N in (4.1.8),
(4.1.9), i.e. the simulations for xo+h and Xo-h are accomplished under the same collection
of realizations of random numbers. Choosing h arbitrary small, one obtains 3y.O 11/2N-I/2
instead of (4.1.10), where
This way, the order of convergence of the approximation as N~oo is higher for dependent
sample. Analogous results are valid for the multidimensional case and the case of higher
derivatives as well. Recently an improvement of the dependent sampling technique was
Statistical Inference 121
created for optimization of discrete simulation models of particular classes, for references
see e.g. Suri (1987).
Note that some of the approaches to global optimization (e.g. see later Section 6.2.1)
need the drawing of a random function g from a parametric functional set G: setting
randomness is equivalent to setting a probability measure on a parameter set. The prior
information is often insufficient for a reasonable choice of this probability measure, and
frequently a uniform distribution on the parameter set is used. Such a choice of
randomness on the parameter set, with great chance, might lead to exotic random
functions from G. Kolesnikova et al. (1987) proposed to construct the probability
distribution by defining uniformity properties on the set of realizations of random
functions from G, see also Zhigljavsky (1985).
It should be also noted that global random search theory can be regarded as a part of
the theory of Monte Carlo method, if the latter concept is interpreted in a broad sense.
The following basic kinds of experiments are distinguished (see Ermakov and Zhigljavsky
(1987»: regression, factorial, extremal, simulation (i.e. Monte Carlo), and discrimination
(including screening) for hypotesis testing. In the above subsections, possible applications
of procedures for design of regression, extremal and simulation experiments have been
discussed. The theory of screening experiment is also important for global search theory:
it consists in determination of the basic potential for construction of algorithms, in the
construction itself, and in separation from many factors those several ones that define the
relation at issue. Some applications of screening experiment theory in global search were
investigated by Szltenis (1989), see also Section 2.3.2.
122 Chapter 4
4.2.1 Statement of the problem and a survey of the approaches for its solution
At each step of most global random search algorithms, there is a sample of points from
subsets ZC'X and values of the objective function f at these points; the distributions for
subsequent points are constructed after drawing certain statistical inferences concerning the
objective function behaviour. The parameters
M Z = supf<x)
xeZ
and the behaviour of the c.d.f. (4.1.2) in the vicinity of the parameter Mz (i.e. for t such
that Fz(t)::: 1), which is unambigously related to the behaviour of the objective function
near the point
x~ = argm~ f(x),
xeZ
are of primary importance in making a decision about the prospectiveness of ZC'X for
further search.
Since for all sets Z and at various iterations statistical inferences are drawn in a similar
manner, for M=M'X=max f and F=F'X they only will be drawn through independent
sample 3={xl, ... ,xN} from distribution P on 'X satisfying the condition P(B(£»>O for
all £>0. It will also be assumed that 'XeR.n and f is evaluated without a noise. The
elements xl ,... ,xN of the independent sample::: are mutually independent random vectors
(a way of their generation can either consist of the direct sampling of a distribution or
include iterations of a local ascent starting at random points, see Section 3.1.6). Some
generalizations to the cases of random noise and of dependence for elements of ::: be
found in Sections 4.2.8 and 4.4.
Statistical Inference 123
(4.2.1)
where Yi=f<xi) for i=I,... ,N are independent realizations of the random variable 11 with
c.d.f.
and to the order statistics 11 lS112S...S11N corresponding to the sample (4.2.1). The
parameter M=max f is the upper bound of the random variable 1'\ (M=vrai sup 1'\), i.e.
Pr{T\$M-£}<l
holds for any £>0. Now, the problem of drawing statistical inferences for M=max f is
stated as follows: having a sample (4.2.1) of independent realizations of the random
variable 11 with c.d.f. (4.2.2), statistical inferences for the upper bound
are to be made. The statistical inference described below can both provide supplementary
infonnation on the objective function, at each iteration of numerous global random search
algorithms (in particular, in order to construct suitable stopping rules) and support the
construction of various branch and probability bound methods (see later Section 4.3).
For convenience, the main conditions to be applied in Section 4.2 are collected below.
(a) The function
is regularly varying at infinity with some exponent (-a.), 0< a. <00, i.e.
lim [V(tv)/V(v)]
v~oo
=CCX (4.2.4)
t i.M, (4.2.5)
If(x) - f(x*)1 = w(11 x - x*II)H(x - x*) + o(lIx - x*II~), IIx - x*1I ~o,
(4.2.7)
is valid, where H is a positive homogeneous function on :ltn \ {OJ of order /3>0 (for H the
relation H(A.z)=A./3H(z) holds for all 1.>0, ZE :ltn) and w is a positive, continuous, one-
dimensional function.
(g) There exists a function U: (0,00)~(0, 00 ) such that for any XE Rn, x;t{), the limit
(4.2.9)
is valid.
(i) There exist positive numbers E2, /3, c3 and c4 such that for all XE B(E2) the
inequality
(4.2.10)
holds.
If the minimization problem of f is considered, then one has to deal with the lower
bound vrai inf 'll=min f of the random variable'll, make the evident substitution in the
definition of x* and M, and change the conditions a) and b) into
a') the function U(u)=F(M-l/u), where u<O and M=vrai inf'll > -00, regularly varies
at infinity with some exponent (-a), i.e.
Statistical Inference 125
lim U(tu)/U(u) = t- a
u~-oo
Several basic approaches to estimate the optimum value M=max f will be outlined below.
An approach involving the construction of a parametric (e.g. quadratic in x)
regression/approximation model of f (x) in the vicinity of a global maximizer is often used
in the case, when evaluations of f are subject to a random noise and it is likely that the
vicinity of x* is reached where the objective function behaves well (for example, f is
locally unimodal and twice continuously differentiable), see later Section 8.1.
Another approach (see Mockus (1967), Hartley and Pfaffenberger (1971),
Chichinadze (1967, 1969), Biase and Frontini (1978)) is based on the assumption that the
c.d.f. (4.2.2) is determined (identified) to a certain number of parameters that are
estimated via (4.2.1) by means of standard mathematical statistics tools. Essentially, the
results of Betro (1983, 1984) (described later in Section 4.6) can be also classified to this
general approach. It has three main drawbacks:
(i) for many commonly used classes of objective functions, the adequacy of the parametric
models is not valid and there do not exist yet constructive descriptions of functional
classes for which these models have acceptable accuracy;
(ii) the presence of redundant parameters decreases the estimation accuracy of M and
increases the computational efforts; and
(iii) construction of a confidence interval for M and testing statistical hypothesis about M
are frequently impossible, while that is of primary importance in optimization problems
(an exception being the method of Hartley and Pfaffenberger (1971) designed especially
for confidence interval construction).
Certainly, a non-parametric approach for estimating c.d.f. (4.2.2) together with M can
be used but it is hard to expect its high efficiency. Generally, the semi-parametric
approach described below is more efficient than the both mentioned. Its another advantage
is that it is theoretically justified (if the sample size N is large enough). This semi-
parametric approach is based on the next classical result of the asymptotic theory of the
extreme order statistics, see e.g. Galambos (1978).
Theorem 4.2.1. Let M=vrai sup T\<oo where T\ is a randorr. variable with c.d.f. F(t) and
the condition a) of Section 4.2.1. be fulfilled. Then
(4.2.11)
126 Chapter 4
where
z<o
(4.2.14 )
Z~O.
The distribution having the c.d.f. (4.2.14) will be called an extreme value distribution and
a - its tail index. The c.d.f. <Pa(z), including the limit case <Poo(z)=exp{ -e- Z }, is the
unique nondegenerate limit of c.d.f. sequences for (1lN-aN)/bN in the case M<oo, where
{aN} and {bN} are arbitrary numerical sequences. The condition a) is a regularity
condition for the c.d.f. F in the vicinity of M: only some exotic functions fail to satisfy it,
see Gumbel (1958), Galambos (1978). In particular, it is fulfilled if the natural condition
b) is met.
The asymptotic representation (4.2.11) means that the sequence of random variables
(1lN-M)/XN, where 1lN is the maximal order statistic from the sample (4.2.1), converges in
distribution to the random variable with c.d.f. (4.2.14).
As shown below, the convergence rate of F(t) to 1 for tiM is mainly defined by the
tail index a of the extreme value distribution. The main difficulties in applying the results
of the extreme order statistics theory are due to the fact that a may not be known in
practice. For this reason, e.g. Clough (1969) suggested a=l that is, of course, generally
false. Dannenbring (1977), Makarov and Radashevich (1981) and some others suggested
- following Gumbel's (1958) recommendation to estimate jointly a, M, and aN via taking
a sample of order N of independent maximal order statistics corresponding to samples for
the r.v. 11. This approach requires an order of N2 evaluations of f and has the drawbacks
(ii) and (iii) detailed above.
The approach below also relies on the theory of extreme order statistics, but is free of
the drawbacks mentioned. It is based on some recent advances of mathematical statistics
described later in Chapter 7, enabling for a prior determination of the value of the tail
index a for a wide range of objective function classes.
The present approach was developed by Zhigljavsky (1981, 1985, 1987),
independently of other authors (cited below) who obtained similar results. The main
similarity is between Theorem 4.2.2 and the outstanding result of de Haan (1981) that will
be properly discussed.
Statistical Inference 127
4.2.3 Statistical inference for M, when the value of the tail index Ct is known
Let us cite from Chapter 7 some results concerning statistical inference for M: the
corresponding proofs, references, comments as well as additional details and information
will be given in Chapter 7.
We shall suppose below that condition a) defined in Section 4.2.1 is satisfied and that the
value of the tail index Ct~l of the extreme value distribution is known. (This is a realistic
assumption for many global random search problems, as will be seen in Section 4.2.6.)
First, let us consider some estimators of M. Linear estimators have the form
( 4.2.15)
k
a'A = 2, a. = 1 (4.2.16)
i=O 1
N -)00 (4.2.17)
(4.2.18)
The explicit form for the components ai* (i=O,I, ... ,k) of the vector a* is as follows:
128 Chapter 4
where
-I
a.r(k + 2)
[ 2 ]
(a.:J:. 2)
c= (a.-2)r(2/a.+k+l) - (a.-2)r(2/a.+l)
N ~oo (4.2.19)
(4.2.20)
Some other linear estimators are also asymptotically Gaussian and asymptotically
optimal (for them an analogy of (4.2.19) is valid). For instance, the estimator (4.2.15)
with coefficients
~
{
.: (k + 1)('2-0)/0(u - If(k + I) 1-2/0 - k 1-2/0] - (u - 2) (4.2.21)
k-l 1\
(ex - 1) 2. ~ .(M) =k + 1 (4.2.22)
. 0 J
J=
under the condition M~llN; here
(4.2.23)
N -+00, (4.2.24)
where Co can be estimated by (4.2.20). This approach leads to less precise confidence
intervals, than the preceding one, but is applicable for any k~l.
The one-sided confidence interval
(4.2.25 )
where
-lin ]
[
r k,l-y = 1/ (1 - yVk) - 1
is usually narrower than the one-sided interval based on (4.2.24). It should be noted that
the one-sided confidence intervals for M - with TIN as a natural lower bound - are of more
interest for global random search theory, than the usual two-sided ones.
Statistical hypothesis testing procedures for M are of great importance in the present
context, too. The standard situation here is as follows. Let there be a record value Mo and
an independent sample (4.2.1) from values ofTl=f(~) where ~ is a random vector given on
a subset Z of 'X. Suppose also that
130 Chapter 4
and it is to be decided whether the function f can have values in Z greater than Mo (Le.
whether f can achieve its global maximum in Z). In other words, one has to test the
statistical hypothesis Ho: Mz > Mo under the alternative R I: Mz ~Mo·
For testing the hypothesis Ro, one can use ways similar to those stated for confidence
intervals construction. According to the approach leading to (4.2.25), the rejection region
forRo is
(4.2.26)
where Ik, l-y is given by (4.2.25). In this test the power function
~ N(M, y) =Pr{Y E W}
for N ~oo
-lla
aJk
PN(M,y)-~,·0ftke- [1-(
DO
t
r
k,l-y
-t
Ie M 0 -M/M-8
) (
+rk,l_y
N)) dt
+ +
k
M Nk = La·ll. l' (4.2.27)
, i =0 1 1 +
(4.2.28)
and
( 4.2.29)
replacing (4.2.15), (4.2.20), (4.2.23), (4.2.25) and (4.2.26), respectively: the other
formulas of the subsection, given above are unchanged.
We turn now to another approach to construct confidence intervals for M which may
be termed as the improved Hartley-Pfaffenberger method. Let us start by quoting an
auxiliary result from Hartley and Pfaffenberger (1971).
It is well known that ti=F(lli), i=l, ... ,N, form an order statistics corresponding to an
independent sample from the uniform on [0,1] distribution and their means and
covariances are E t i =Ili, cov( t i, t j )-Vij (for i,j=l, ... ,N) where
The covariance matrix V of t is symmetric, has the order kxk: and elements Vij for
15 i5j5 k. According to Hartley and Pfaffenberger (1971), the quadratic form
2
s (k,N) = (t -1l)'V -1 (t -11)
can be written as
2
s (k,N) = (N + l)(N + 2)[
tN-k+l N+l 2
1
N 2_ k + 1 + i=N~k+/ti - t i _1) - (N + 2)
The dependence of sy on N is rather mild: one may approximately select s"( for "(=0.05,
0.01 and k=5, 10, 15,20 as follows sO.05=15, 25, 33, 40, s0.Q1=28, 40, 50, 55.
Hartley and Pfaffenberger (1971) used the unreasonable supposition that the c.d.f.
F(t) is represented by a Taylor expansion at t=M, rather, than a condition of type (a). We
shall suppose that (b) is fulfilled and k is such that the extreme order statistics 'TIN, ... ,
'TIN-k+ 1 fall into the vicinity AM of M, where the approximation
for i = 0, 1. ... ,k - 1
(l
tN ~. = F( 'TIN ~.) = 1- c ~
1M - 'TI N-1.)
and put them into the expression for s2(k,N) that depends on Co and M as well as k and
N. Denoting s2(k,N)= s2(k,N ,M, co) and
we come to the upper confidence bound M 1-"( for M of a level 1-"( defined as the solution
of the equation s2(k,N,M l-"()=S"( under the condition M l-"(~'TIN. The corresponding
confidence interval for M is [ 'TIN, M 1-,,(].
Since s2(k,N,M, co) as a function of Co is a polynomial of degree two, the expression
for s2(k,N,M) is easily derived:
S2(k,N,M) = s2(k,N,M,c*)
where
nothing, however, about the accuracy of the confidence intervals constructed by applying
the improved Hartley-Pfaffenberger approach.
4.2.4 Statistical inference, when the value of the tail index a is unknown
can be used in which m<k. If k---+oo, m---+oo, k/N---+O (for N---+oo) then
ft
is consistent and asymptotically unbiased. For the practically advantageous case k=5m,
there holds
The estimator
where
ft
is determined by (4.2.30) and ",(k)=r'(k)1 r(k) is the ",-function, is more precise than
tl.
Set
forO~u~l
for u> 1
(4.2.32)
(4.2.33)
N ~oo.
k-I
[ .L log (I + f3.(~.1))
~-I [k-I
-.L f3 .(M)
]-1 = lI(k + I)
J=O J J=O J
where p.(M)
J
is determined by (4.2.23). Under some auxiliary conditions (including a>2, N~oo,
k~oo, k/N~O), this estimator is asymptotically Gaussian with mean M and variance,
For large values of Nand k, this result may be used for constructing confidence intervals
and statistical testing procedures for M.
If k can not be large (for the case of moderate values of N) then an other way of
constructing confidence intervals and hypothesis testing for M is more attractive. Its
essence is in the result of de Haan (1981) that if k~oo, k/N~O, N~oo then the test
statistic
asymptotically follows standard exponential distribution with the density e-t, t>O.
Statistical Inference 135
If the minimization problem of f is stated, then the formulas (4.2.30) and (4.2.34) are
to be transformed into
(4.2.35)
and
(log k)log[(ll 2 - M) I (ll 1 - M) J
log[(ll k- II 3)/(ll 3- II 2)J
Estimators of the c.dJ. F(t) defined by (4.2.2) can be widely used in global random
search algorithms for the prediction of further search, construction of stopping rules or
rules for change to local search (if such a search change is designed). Again, the behavior
of F(t) in the vicinity of M=vrai sup II (i.e. for t such that F(t)'" 1) is of most interest.
The ways of estimating F(t) for t",M are based on the asymptotic representation
(4.2.11) which follows by supposition (a) and can be rewritten in the form
This asymptotic representation is valid for all t<M, but the more close is t to M, the more
close the right and left hand sides of (4.2.36) are.
The simplest way of using (4.2.36) for estimating F(t) consists of replacing M, XN,
and ex. by their estimators. Thus, if one uses (4.2.15), (4.2.30) and MN,k-llN as an
estimator for XN, the estimator
(4.2.37)
for N ~oo
that implies
(4.2.39)
or substitute here
(I
for ex. if ex. is unknown.
Combining (4.2.36) and (4.2.39) one obtains
for t iM (4.2.40)
(4.2.41)
t>f * =minf,N~oo,
(N ~oo),
Sometimes the value of the tail index <l of the extreme value distribution can be
determined, due to the following obvious result: if a c.d.f. F(t) is a sufficiently smooth in
a vicinity of M=vrai sup 11<00 and there exists an integer ~ such that F(i)(M)=O for O<i< ~
and 0<1 F( OeM) I <00 then (a) is fulfilled and <l= ~, see Gumbel (1958). The above
sufficient condition for (a) is a particularity of (b) for the case of integer ~. If one would
apply the fractional differentiation concepts (see Ariyawansa and Templeton (1983) for
references) then the integrality requirement can be omitted for ~, and the above condition
coincides with (b). Note passing by that the condition (b) is somewhat stricter than (a),
because the latter permits for Co from (4.2.5) to be a slowly varying function.
The distinguishing feature of using the above described statistical procedures in global
random search lies in the fact that the c.d.f. F(t) has the form of (4.2.2). Using this fact
and prior knowledge about the objective function behavior in the vicinity of a global
maximizer, the tail index <l of the extreme value distribution can be often exactly
determined as is demonstrated below. The basic result in this direction is as follows.
Theorem 4.2.2. Let the conditions (c) through (f) (see Section 4.2.1) be satisfied. Then
the condition (a) holds and <l=n!13.
Proof. In terms of the notation used, the condition (a) is represented as follows: the limit
of P(A(t e»IP(A(e» for do exists for all t>O and equals til. It is well known (see de Haan
(1970» that it suffices to require the existence of the limit only for tE [0.5,2].
Put z=x-x*,
The assumptions (c), (d) and (f) together with the continuity and positivity of w imply
ge/e--+0 for e--+O. Further, for each tE [0.5, 2] and e>O there holds
Analogously,
Analogously,
The condition (t) characterizes the behavior of the objective function near to x*. There
are two important special cases of (t). Assume that all the components of Vf (x*) (the
gradient of f in x*) are finite and non-zero which usually happens if the extremum is
attained on the boundary. Then H(z)=z' Vf(x*), w=l, and therefore ~=1 and <x=n. If it is
Statistical Inference 139
assumed that f is twice continuously differentiable in some vicinity of x*, Vf (x*)=O, and
the Hessian V2f(x*) is nondegenerate then w=l, H(z)=-z'[ V2f(x*)]z, ~=2, and a=n/2.
These two special cases (under somewhat stricter conditions on P and x*) were known
mainly due to de Haan (1981) who deduced them (as well as (a» from the condition (g)
rather than (f). The former case was investigated also by Patel and Smith (1983) for the
problem of concave minimization under linear constraints.
In cases where it is not evident in advance that (4.2.7) or (4.2.8) holds, but the
growth rate of f near x* is known, then the following assertion may be useful.
Theorem 4.2.3. Let the conditions (a), (c), (d), (h) and (i) of Section 4.2.1 be met.
Then a=n/~.
It follows from (c) that £5>0 and from (h) and (i) that for all u, 0<u$.£5, and tE [0.5, 2]
the followings holds
Hence,
( 4.2.42)
where
In virtue of (a) the limit P(A(tu))fP(A(u)) exists for u J- 0 and any tE [0.5,2], and it equals
tao It follows from (4.2.42) that any value of a different from a=n/~ is inadmissible. The
proof is complete.
The conditions of Theorem 4.2.3 are slightly weaker than those of Theorem 4.2.2,
except for the requirement of meeting (a), but Theorem 4.2.2 not only infers the
expression for the tail index a, but also substantiates the fact that the convergence to the
extreme value distribution takes place.
140 Chapter 4
The following statement is aimed at relaxing the uniqueness requirement for the global
maximizer.
Theorem 4.2.4. Let P be equivalent to the Lebesgue measure ~n on 'X and let the
function f attain its global maximum at a finite number of points xi* (i=I,2, ... ,f) in
whose vicinities the tail indexes crt can be estimated or determined. Then the condition (a)
for c.d.f. F(t) is met and a=min{ ab ... , a~ }.
The assertion follows from the fact that f is represented under our assumptions as
f
f= Lf·
i=1 1
where
and fi (i=I, ... ,n) is a measurable function attaining the global maximum M at the unique
point Xi*, as well as from two following lemmas.
Lemma 4.2.1. If
then
f
F(t)= Lp.F.(t)
i=1 1 1
where
f
L p. = 1,
i=1 1
P.(A)=
1
peA n'X.)/p.
1 1
for A E :8.
f
F(t)= 51 [f(x)<t]P(dx)= i::1 J.I[f(X)<t]P(dx) =
1
Statisticallnjerence 141
~ l
= L I I[J ( ) lP(dx) =L p.F.(t).
i=l x. i x <tJ i=l 1 1
1
Lemma 4.2.2. If the condition (a) with parameters (X= (Xi is fulfilled for c.d.f. Fi(t),
i= 1,... , i, then it is also met for c.d.f.
l l
F(t) = L p.F .(t) (where p.1 > 0 for i = 1, ... , i, 2. p. = 1)
i=l 1 1 i=l 1
with the value am=min{ (Xl, ... , (Xd for the tail index.
v 1.(v) = 1- F.(M
1
- l/v) (v> 0, i = 1, ••. , i)
as
-(I.
V .(v) = L .(v)v 1
1 1
where L:i(v) are slowly varying functions, i.e. such functions that the limit of L:i(tv)1L:i(v)
exists and equals 1 for any t>O and v~oo.
It suffices to demonstrate that under v~oo the ratio
(I
A(t,v) = t m(1- F(M - l/(tv))/(1- F(M -l/v)))
(I l l l
A(t,v)=t mLP.V.(tv)ILP.V.(V)= 2.p.B.(t,v)
i=l 1 1 j=l J J i=l 1 1
where
=LL p.t
l (I - ( I
i m
·=1 J
142 Chapter 4
If o.po.m then Bi(t,v)--+O (v--+oo) in virtue of the properties of slowly varying functions
(see de Haan (1970». If ~=o.m and o.j>o.i for j:t!:i then Bi(t,v)--+l!Pi (v--+oo). Existence
of limits in these cases is obvious. It remains to consider the case where several ~'s are
simultaneously equal to a.m.
We shall prove the following: let the c.d.f.'s Fi(t), i=l, ... ,r meet the condition (a)
with index a., then the c.d.f.
F o(t)
r
= I, q.F .(t)
j=l J J
( where q. > 0 for j
J
= 1, ... ,r, ± = 1)
q.
j=l J
Similarly
Statistical Inference 143
Proposition 4.2.1. Assume that the set of global maximizers X*={arg max f} is a
continuously differentiable m-dimensional manifold (O.$m.$n-l), f is continuous in the
vicinity of the set X*, and the conditions (d), (e) and (f) are satisfied for all X*EX*.
Then (a) is satisfied and a=(n-m)/~.
The proof is similar to that of Theorem 4.2.2.
From Proposition 4.2.1 and Lemma 4.2.2 directly follows the next.
~
X *= u X ~
j=l J
min [en
a= l~j~~ -m.)/~'J'
J J
are homogeneous functions of order ~j> ~.sn, z=(z(1), ... ,z(O )eR.n, zG) are disjoint
groups of nj variables of vector z, n 1+... +n t =n. Then (a) is satisfied and
t
(l::: 2. n./~.
j=I J J
For the simplest case of n=2, i =2, the proof differs from that of Theorem 4.2.2 in the
following: for u~O there holds
c1
::: 1t II C 1 1/~1+1/~2 II
u 1/~1+1/~2 D(x,y)dxdy ::: 1t u D(x, y)dxdy
x+ySl x+ySl
where D stands for the Jacobian corresponding to the change of variables x=H 1(z(1»,
y=H2(z (2». For the general case the proof is analogous.
The case of i=2, ~1 =1, ~2=2 typifies situations where the above result proves useful:
this case corresponds to non-zero first derivatives of f at x* with respect to the variables
z(l) and to zero first derivatives and non-degenerate Hessian matrix with respect to the
variables z (2).
The following generalization of the situation, where Theorem 4.2.2 and similar
assertions can be used, is based on the possibility of the exact determination of the tail
index (l in the case where the objective function values f<x) are evaluated with a random
noise whose distribution is not influenced by the location of xe 'X and lies within a finite
interval. The generalization relies upon the following assertion that can readily be proved.
Statistical Inference 145
Proposition 4.2.4. Let the condition (a) be satisfied for distributions Fl, F2 and
F 1*F2, the tail index of the extreme value distribution for Fi being ai (i= 1,2). Then for
the distribution Fl *F2 the tail index is a=max{ al> a2}'
It follows from Proposition 4.2.4 that if the tail index corresponding to the distribution
of the random variable f@ is al and if that of the noise ~ distribution is a2, then the sum
max f+vrai sup ~ is the bound of the random variable y= f(~)+~, and for y only
max {ai, a2} can be the tail index.
Further generalization of the situation to which the above apparatus is applicable is
possible, if one abandons the independence assumption for realizations xi of the random
vector ~ having distribution P(dx). To this end, it should be noted that the apparatus of
statistical inferences for the random variable bounds is based on the fact that the
distributions of random variables (l1N- M)j(M -eN) converge to that of extreme values for
N ~oo. This fact was shown (see Resnick (1987), Lindgren and Rootzen (1987» to be
true not only in the case of independent random variables Yl, Y2, ... , but be generalizable
to cases where this sequence is a homogeneous Markov chain exponentially converging to
a stationary distribution, or strictly stationary m-dependent sequence, or that of
symmetrically dependent random variables. The apparatus described in this section is,
thus, applicable to the majority of algorithms presented in Part 2. Section 4.4 will
investigate another case of dependent sample Y={ Yl, ... , YN} for which the above
apparatus is generalizable.
As a consequence of the results of the present section, we shall obtain a result on the
exponential complexity of Algorithm 3.1.1. The mean length of the one-sided confidence
interval (4.2.25) for M=max f of fixed level l-y is taken as the measure of algorithm
accuracy; we shall study the growth rate of the number N of evaluations of f required for
reaching a given accuracy under n~oo.
Theorem 4.2.5. Let assumption (b) and the conditions of Theorem 4.2.2 or 4.2.3 be
fulfilled. Then, in order to make the asymptotic mean length of the confidence interval
(4.2.25) be equal to E, the number N of objective function evaluations in the algorithm of
uniform random sampling of points in X, has to grow with rate N-coc n (n~oo) where
the parameter Co has the same sense as in (b) and
liP
c = [CW(k + 1) - W(1»/( - E log (1- ilk))] , W(u) = r'(u)j['(u)
is the '{I-function. The theorem's assertion follows from
Lemma 4.2.3. Let the conditions of Theorem 4.2.5 be fulfilled. Then the mean length
of one-sided confidence interval (4.2.25) is asymptotically equal (for N~oo, n~oo ) to
146 Chapter 4
Proof. It follows from Lemma 7.1.2 (given later) that the mean length of the confidence
interval (4.2.25) for N ~oo equals ~ 1 ~ 2 ~ 3 where
If the conditions of Theorems 4.2.2 or 4.2.3 are satisfied then a.=n/~ and therefore for
n~oo
It follows from the condition (b) that the representation (4.2.38) is applied with a.=n/~
and thus
N ~oo.
Combining the limit relations for ~ 1. ~ 2 and ~ 3. one obtains the desired result: this
completes the proof of the lemma (i.e. also that of Theorem 4.2.5).
Statistical Inference 147
This section describes a class of global random search algorithms applying the
mathematical apparatus developed in Chapter 7 and in the preceding section. These
algorithms are closely related, in terms of their philosophy, to the branch and bound
methods that were reviewed in Section 2.3 and the key concept in their description is that
of prospectiveness.
A sub-additive set function <p:n~R resulting from processing the outcomes of previous
evaluations of the objective function and reflecting the possibility of locating the global
optimizer in subsets is referred to as prospectiveness criterion. If for two sets ZI, Z2E n
the inequality <p( ZI )~ <p( q) holds, then the location of the global optimizer in ZI is at
least as probable as in q, according to the prospectiveness criterion <po
A number of set functions <p (assuming ZE n) can be used as prospectiveness criteria;
examples are:
a) <p( Z) is an estimate of the maximum
M Z = max f(x),
XEZ
Jt(x)v(dx)
Z
min f(x)
XEZ
certain (e.g. Lipschitz-type) functional classes and correspond to the standard branch and
bound approach treated earlier in Section 2.3.4.
Branch and bound methods, used to advantage in various extremal problems, consist in
rejecting some of the subsets of 'X that can not contain an optimal solution and searching
only among the remaining subsets regarded as promising. Branch and bound methods
may be summarized, as successive implementation at each iteration of the following three
stages:
i) branching (decomposition) of the (one or several) sets into a tree of subsets and
evaluating the objective function values at points from the subsets;
ii) estimation of functionals that characterize the objective function over the obtained
subsets (evaluation of subset prospectiveness criteria); and
iii) selection of subsets, that are promising, for further branching.
In standard versions of branch and bound methods, deterministic upper bounds of the
maximum of f on subsets are used as subset prospectiveness criteria. By doing so, all the
subsets Z are rejected whose upper bounds for Mz do not exceed the current optimum
estimate. Prospectiveness criteria in these methods, thus, can be either 1 or 0 which means
that branching of a subset should go on or be stopped.
In the following consideration will be given to non-standard variants of branch and
bound methods that will be referred to as branch and probability bound methods. Their
distinctive feature is that the maximum estimates on the subsets are probabilistic (i.e. these
estimates are valid with a high probability) rather than deterministic.
The methods under consideration are distinguished by the (i) organization of set
branching, (ii) kinds of prospectiveness criteria, and (iii) rules for rejecting unpromising
subsets.
Set branching depends on the structure of 'X and on the available software and
computer resources. If 'X is a hyperrectangle, then it is natural to choose the same form
for branching sets Zkj where k is the iteration count, and j is a set index. In the general
case, simplicial sets, spheres, hyperrectangles and, sometimes, ellipsoids can be naturally
chosen as Zkj- Two conditions are imposed on the choice of Zkj= their union should cover
the domain of search and the number of points where f is evaluated should be, in each set,
sufficiently large for drawing proper statistical inference. There is no need for the sets Zkj
to be mutually disjoint, for any fixed k.
Branching/decomposition of the search domain can be carried out either a priori (i.e.
independent of the values of f), or a posteriori. Numerical experience indicates that the
second approach provides more economical algorithms. For example, the following
branching technique has proved to be advantageous. At each iteration k, first select in the
search domain 'Xk a subdomain Zkl with the centre at the current optimum estimate. The
point corresponding to the record of f over 'Xkl Zkl is the centre of subdomain Zk2.
Statistical Inference 149
Similar subdomains Zkj (j=I, ... ,'L) are isolated until either 'Xk is covered, or the
hypothesis is rejected that the global maximum can occur in the residual set
1..
'Xk\ U Zk'
j=1 J
(the hypothesis can be verified by the procedure described in Section 4.2.3). This way,
the search domain 'Xk+l ofthe (k+l)-th iteration is
(k) 1..
Z = UZ k.,
j=l J
(4.3. 1)
where 11m and 11m-i are respective elements of the order statistics corresponding to the
sample (y ~=f(~~), ~ =1, ... ,m}, i is much smallerthan m, ~~ are independent realizations
of a random vector on 'X such that fall into Zkj- The value Pkj can be treated in two ways:
in the asymptotically worst case (for i=eonst, m-+oo) it is more than or equal to, first, the
probability that
and provided that the hypothesis is true and that the hypothesis testing procedure is that of
Section 7.3. In order to obtain (4.3.1) from (4.2.25) and (4.2.26) in which m, i, and Mk*
are substituted for N, k, and Mo, it suffices to solve with respect to 1-1 the inequality
In the algorithm below the number of points at each k-th iteration in the promising
subset Zkj is assumed to be (in the probabilistic sense) proportional to the value of the
prospectiveness criterion CPk(Zkj); further branching is performed over those sets Zkj
whose values (4.3.1) are not less than the given numbers ok.
Algorithm 4.3.1.
1. Set k=l, XI= X, Mo*=-co. Choose a distribution Pl.
2. Generating Nk times the distribution Pk, obtain a random sample
0;:'
..... -
k- {x(k)
1 , ... , x(k)
Nk } .
where Zkj are measurable subsets of X having a sufficient number of points from Sk for
statistical inference drawing.
5. For each subset Zkj. compute (not necessarily through (4.3.1» the values of the
prospectiveness criterion CPk(Zkj).
6. Put
where
i.e. those subsets Zkj for which CPk(Zkj)5,Ok are rejected from the search domain Xk. Let
'Lk be a number of remaining subsets Zk{
7. Put
152 Chapter 4
(4.3.2)
where Pj(k)=I/Lk and Qkj are the uniform distributions on sets Zkt 0=1, ... , 'Lk). If CPk is
a nonnegative criterion, then we also may take
where
if
if
global maximum in the rejected sets Zkj as computed via (4.3.1). The total probability of
missing the global maximum point, as determined by (4.3.1)is, then at most max 'Ykj-
Indeed, let us take the set Zkjo= Zkj which contains the point corresponding to the
statistics TIm of the set
Z= UZ k .•
k· J,J
Then Tlm-i for Z is not less than Tlm-i for Zkjo and
(it plays the role of Mk * for Z) is not less than Mk *. But Pkj defined via (4.3.1) is an
increasing function of both Tlm-i and Mk * for fixed a, i, and TIm .
The philosophy of constructing the branch and probability bound method is closely
related to that of those algorithms which are based on objective function estimation.
However, in the methods under consideration only functionals
M
Zkj
are estimated rather than the function f itself. But if one assumes that in Algorithm 4.3.1
~k=-oo for all k=I,2, ... , then one arrives at a variant of adaptive random search, in which
more promising domains are looked through more thoroughly. Algorithm 4.3.1 is then a
special case of Algorithm 5.2.1, where fk(X) for all the points ofZkj is equal to CPk( Zkj)
(if the subsets Zkj are disjoint for fixed k).
If the evaluations of f are costly, then one should be extremely cautious in planning
computations. After completing a certain number of evaluations of f, one should increase
the amount of auxiliary computations with the aim of extracting and exploiting as much
information about the objective function as possible. To extract this information one
should: compute the probabilities (4.3.1); construct various estimates of
Mz ;
kj
check the hypothesis about the value of the tail index a of the extreme value distribution
which can provide information whether a vicinity of the global maximizer is reached or
not; estimate the c.d.f. F(t) for the values of t close to M (this will enable one to draw a
conclusion about the approximate effort related to the remaining computations); and, in
addition, one can estimate f (and related functions of interest) in order to recheck and
update the information. Decisions about prospectiveness of subsets should be made
applying suitable statistical procedures. It is natural that these procedures can be taken into
consideration only if the algorithms are realized in an interactive fashion.
The major part of assertions made in this section have precise meaning only if for the
corresponding c.d.f. F(t) the condition (a) of Section 4.2 is met and the parameter a of the
extreme value distribution is known. As for the condition (a), practice shows that it may
154 Chapter 4
always be regarded as met, if the problem at hand is not too specific. In principle,
statistical inference about a can be made via the procedure of Section 4.2.4 that is to be
carried out successively as points are accumulated. However, it is recommendable to use
the results of Section 4.2.6 if possible, since the accuracy of procedures for statistical
inference about a is high only for large sample sizes. According to these results for the
case when 'XcRn and the objective function f is twice continuously differentiable and
approximated by a non-degenerate quadratic form in the vicinity of a global maximizer x*
one can always set a=n/2. While doing so one may be confident that statistical inferences
are asymptotically true for subsets ZC'X containing x* (together with subsets Z containing
maximizers
as interior points). The prospectiveness of other subsets Z may be lower than for the case
of using their true values of a, but this will not crucially affect the methods in question.
Let us finally describe a variant of the branch and probability bound method that is
convenient for realization, uses most of the above recommendations, and proved to be
efficient for a wide range of practical problems.
Algorithm 4.3.2.
1. Set k=l, 'Xl = 'X, Mo*=--00.
2. Sample N times the uniform distribution on the search domain 'Xk, obtain
Ek={x1 (k), ... ,XN(k) }.
3. Evaluate f(Xj (k) ) for j=l, ... ,N. Put
Check the stopping criterion (closeness of Mk* and the optimal linear estimator (4.2.15)
for M with a=n/2).
4. Set Yk,o= 'Xk, j=1.
5. Set Yk,j= Yk,j-1 \Zkj where Zkj is a cube (or a ball) of volume PJ..ln( 'Xk) centered
at the point having the maximal value of f among the points from the set Ykj-1 in which
the objective function is evaluated.
6. If the number m of points in Yk,j with known objective function values is
insufficient for drawing statistical inference (i.e. m~IDo), then set 'Xk+ 1= 'Xk and go to
Step 9. Ifm>IDo, then take
Statistical Inference 155
k
Y k .n u S
,J t= 1 ~
Let P be a probability measure on (X,:8) that is absolutely continuous with respect to the
Lebesgue measure Iln on X (in the important particular case P(dx)= Iln (dx)/lln (X) it is
the uniform distribution on X ). Define n=xN, 3=(xJ, ... ,xN) where xie X for
i=l, ... ,N. Assume that 3 is a random vector on n, denote its distribution by Q(d3), and
suppose that N=mf where m and f are integers.
Divide X into m disjoint subsets Xl,"" 'Xm of the same P-measure:
m
X = UX., X.e:B,
J
j=l J
m
Q(dS)= ITP.(dx ij-i+l)P.(dx
j=l J J ij-i+2
) ... P.(dx
J ij
)..
In the particular case m= 1, the stratified sample is independent.
Sometimes it is convenient in practice to organize a stratified sample so that the
arrangement of elements of 3 is random. Under sequential sampling of random points
xiE 3, this is attained by means of uniform random choice of the distribution to be
sampled among the distributions Pt. ... ,Pm sampled less than i times.
We shall describe an economical way of stratified sample organization when the
feasible region X is a cube X=[O, l]n and the stratification consists of dividing X into
hyperrectangles Xj.
Let us represent the number m as a product m=ml ... m n where mi is a number of
intervals into which the cube X is divided by the i-th coordinate. Suppose that m=dPi,
where d, Pl> ... 'Pn are some integers. In the case the intervals have length l/mi, it is
convenient to correspond them to the ordered collections of Pi figures in the d-adic
representations (i.e. figures 0,1, ... ,d-1): this is connected with the fact that all points of
each interval have the same first Pi figures. A random point from an interval is easily
obtained by adding on the right-hand side some more figures corresponding to the d-adic
representation of a random number to the given collection of Pi figures.
Hyperrectangles Xj of volume 11m correspond to the ordered collections 8j from
p=p 1+···Pn d-adic figures: each of them 8=(u 1,... ,u p ) is naturally identified with a
number 8=u 1dP-l+U2dP-2+ ... +up. Let us show how multiplicative random number
generators should be used for obtaming pseudorandom numbers so that a stratified sample
would possess the mentioned property of the random arrangement of Xj-
Ifm=2P, i.e. d=2, then to get a number 8i we may take e.g. the generator
where A=5 (mod 8), 8 0 =1 (mod 4) and in each number 8j only its first p digits are to be
used. As it follows from Ermakov (1975), p. 405, this generator has period length m. By
means of this generator the sets Xl, ... , Xm are chosen pseudorandomly until they all are
chosen. If i>l and it is required to choose pseudorandomly the sets Xl, ... , Xm again,
then the same generator can be used, with a new initial value 80 .
If m=10P, i.e. d=lO, then in order to obtain the numbers 8i we may proceed as
follows. Take a multiplicative generator Vj=AVj_1 (mod K) where A is a primitive root
with respect to the modulus K, K~m, K is the prime number nearest to m, and one can
take vo=K-l. This generator has the period K-l, i.e. pseudorandomly gives K-l different
numbers from the collection {l,2, ... ,K-l}. Those numbers Vj which exceed m should be
158 Chapter 4
omitted and for the remaining ones set ej=Vj-l. The amount of omitted numbers equals K-
m-l and does not exceed m-l (as it follows from the so-called Bertran postulate).
4.4.2 Statistical inference for the maximum of a junction based on its values at the
points of stratified sample
Let 'X be divided into m disjoint subsets 'Xjm (j=I, ... ,m) of equal volumes
Iln('Xjm)=lln('X)/m and let there be given ~ uniformly distributed points Xjl, ... ,Xj~ in
each subset 'Xjm. Then the sample
(4.4.1)
is the c.d.f. (4.2.2) corresponding to the uniform distribution Po on 'X; em,~ is the
(I-lIO-quantile of F*m(t) determined by the condition F*m(em,~)=I-lI~; em,~ is the
(I-limO-quantile of c.d.f. F(t) determined by F(em,~)=I-lI(mo.
From the theoretical point of view, the case when the number m of stratification of the
set 'X tends to infinity but the number of points ~ in each subset 'Xjm is constant, is of
most interest and will be considered below. .
Statistical Inference 159
The main asymptotic properties of the order statistics TIm ~-i, i=O,l, ... ,~-l are
presented in the following. '
Theorem 4.4.1. Let a functional set j=' consist of continuous function f given on X
such that the condition (a) from Section 4.2 for the c.d.f. (4.4.3) holds and the unique
global maximizer x* has a certain distribution R on (X,:8) equivalent to the Lebesgue
measure Iln on ( X, :8 ). Then for m-+oo, ~ =const, with R-probability one, the following
statements are valid:
a) the limit distribution of the random variable sequence
( TI m,t - M\/(M-
J
e mt ) (4.4.4)
for u < 0,
(4.4.5)
for u~O;
b)
-I I/a
F. (v)-M- (M-
m
e m~
)[e(1-v)] (4.4.6)
M - ETJ
m,t-I
. - 'I'(e, lIa)(M - em~
)b.
1
(4.4.7)
d)
2
E(TI m,t -I. - M\) (TI m,t -J"- M\) - 'I'(e,2/ a)(M - em )
~
A.""IJ (4.4.10)
Proof It follows from the above listed conditions that with R-probability one, x* does not
fall into the boundary of the sets x,jm for all j.:s;m, m=1,2, ... : therefore we shall deal with
such events only during the proof.
Consider the f greatest values
{ y.=f(x.),x.e3
1 1 1 m,~
}.
Since f is continuous and its global maximum is attained at the unique point x*, thus for
m~oo the points x(N), ... ,x(N_~+l) belong only to the sets x,*m. Therefore under the
given conditions the c.d.f. (4.4.2) and (4.4.3) are connected by the relationship
Hence and from the definition of the quantities em, f and emf there follows
e =e (4.4.13)
m) m~
for u < o.
~
-[mexp{-(-u)a/(~m)}-m+1J -
The asymptotic relation obtained is equivalent to the statement a) of Theorem 4.4.1. It may
be also rewritten like
i ~-i-1 i ,
P .(t) = iC F* (t)(I- F* (t)) F * (t).
m,~-1 ~-1 m m m
Thus
i ooJ ~-i-1 i
= ~C tF* (t)(I- F* (t)) dF* (t).
~-1_00 m m m
i 1 -1 . i
E11 = ~C JF * (u)u~ -1-1(1- u) duo
m,~-i ~-10 m
E11 -
m,~-i
- ~d
~-1
[Mf0 u~ -i-1(1_ u)idu - (M - e m~
)l1/a ju~ -i-1(1_ u)i+1/a. dUJ =
0
162 Chapter 4
The statement c) has been proved. Finally, statement d) is proved in analogy with Lemma
7.1.3. The joint distribution density of the random variables llm,t-i and llm,t-j for i2j
equals
p , . I .(t,s)
m,<-l, -J
=
~-i-l i-j-I j
=AF* m (t)F'* m(t)[F* m(s)-F* m(t)] F'* m(s)(l-F* m(s»),t:5:s,
This way,
s
= f ds f
-00 -00
(t - M)(s - M)p .
m,~-l,~-J
.(t, s)dt = I.
(4.4.15)
where
Replacing the integration order in the integral I I and using the beta-function property
fam_I
t (a - t)
k-l
dt = am+k-1B(m,k), a,m,k > 0,
o
we have
Statistical Inference 163
Substituting A and the expression obtained for 11 into (4.4.15), we get (4.4.10). The
theorem is proved completely.
It follows from the theorem that the objective function records corresponding to
stratified samples are asymptotically subject to probabilistic laws similar to those which
hold for the case of independent samples. It is easily seen that for f ~oo the mentioned
asymptotic laws coincide. In particular, <Pa,oo(u)= <Pa(u) where <Pa and <Pa,~ are
determined by (4.2.14) and (4.4.5), respectively. This implies that the limit distributions
of record values for independent and stratified samples for m~, t ~oo coincide.
The conditions of Theorem 4.4.1 are slightly stricter, than the condition (a) of Section
4.2: under condition (a), the record values for an independent sample are subjected to
analogous asymptotic relations. The additional condition requires the existence of a
probability distribution R for x* which is equivalent to the Lebesgue measure on X. The
distribution can be regarded to as a prior distribution for the maximizer, and thus the set of
:r
objective functions is regarded as being stochastic. This requirement is necessary in
order to assure that only such events be considered when x* does not hit the boundary of
any set X*m' In practice, the number m is finite: thus the requirement is not very
restricted, so much the more that the distribution R does not occur in the formulas.
The availability condition for a prior distribution R for x* may be replaced by the more
explicit, but more restrictive one:
In general, this condition can be satisfied with some error only. It is satisfied with a high
accuracy for the case where 2. is a IT-r-grid (see Section 2.2.1). Although the hitting of
164 Chapter 4
elements of a nt-grid into the set X*m cannot be considered as uniformly distributed in
the probabilistic sense, a stratified sample pattern is a good approximation for a ~-grid.
To estimate the parameter M, we shall use linear estimators of the type
k
M = La.T\ m,t-i
m,k i=O 1
(4.4.16)
where kS ~ -1 and ao,a 1,... ,ak are some real numbers defining the estimate.
From (4.4.7) it follows that under the fulfilment of the conditions of Theorem 4.4.1,
is valid where
A. = (1, 1, .. , , 1) I ,
the values bi are defined by (4.4.8) and the function '¥ by (4.4.9). Now, it follows from
(4.4.10) that if the mentioned conditions are met, then
(4.4.18)
holds where A is a symmetric matrix of order (k+ 1)x(k+ 1) with elements A.ij defined by
fonnula (4.4.11) for oQSisk.
Since f is continuous, the c.d.f. F(t) and F*m(t) are continuous and therefore
8 m ~ '" M, M-8 m ~ ~O for m~oo and each integer ~. Applying now the Chebyshev
inequality
and (4.4.18), we can conclude that under the fulfilment of the conditions of Theorem
4.4.1 the estimators (4.4.16) converge to Ma'A. for m~oo. Hence, the sequence of
estimators (4.4.16) is consistent if and only if the relation
k
a'A. = La. = 1 (4.4.19)
i=O 1
EM k-M-(M-8 )'I'(f,l/a)a'b,
m, ml
( 4.4.20)
k
a'b = La .f'(l/a + i + 1)/r(i + 1) = 0 (4.4.21)
i=O 1
is called the unbiasedness requirement.)
From (4.4.20) follows that a natural criterion for the optimal selection of parameters a
is the quantity a'Aa which is to be minimized on the set of vectors a satisfying either
restriction (4.4.19) or (4.4.19) together with (4.4.21). These optimization problems are
similar to those which will be treated in Section 7.1, for the case of independent sampling.
For instance, the vector a=a* determining the optimal consistent linear estimator is
detennined by (4.2.18).
Confidence intervals for M can be constructed by consistent estimates of the type
(4.4.16) applying the asymptotic inequality
m~oo,
which follows from (4.4.20) and the Chebyshev inequality. Another way of constructing
confidence intervals for M is based on the following statement.
Lemma 4.4.1. Let the conditions of Theorem 4.4.1 be fulfilled. Then the sequence of
statistics
Proof. For the order statistics 11m,f-i, i=O,l, ... ,f-l an analogy of the Renyi
representation
is valid. Here ~o, ... , ~i are independent random variables distributed exponentially with
the density e- u , u>O. Combining this with (4.4.6), we have
166 Chapter 4
M-ll
m,f
_.-(M-e
I mf
)[f(l-exp{- .~~'/(f-j)}J~lIa,m~oo.
J=O J ~
(4.4.22 )
Taking into account that w>l, applying the asymptotic equation (4.4.22) for i=O, 1 and
changing variables
= i(~ - 1) J (WX+1-W
1
J yt -2 dy ) dx =
1-1/w 0
1
a
=t J (wx+1-w)
t~
dx=l/w=(u/(u+1)).
l-l/w
[ Tl
m,t
,Tl
m,t
+(Tl
m,t
-Tl
m,t-1
)/((1_"()-lI(J._l)].
It follows from (4.4.7) that the mean length of this confidence interval for m-7 OO
asymptotically equals
(4.4.23)
Note that this quantity is 11'P( ~, l/a) times smaller, than the asymptotic mean length of the
analogous interval constructed through an independent sample. It also follows from
Lemma 4.4.1 that the statistical hypothesis testing procedures for M using two maximal
order statistics coincide for stratified and independent samples. (We remark that an attempt
failed to generalize Lemma 4.4.1 for the case k> 1: therefore the corresponding confidence
intervals and statistical hypothesis tests for k> 1 were not constructed for the case of
stratified sampling.)
We shall show first that stratified sampling dominates the independent one, when the
quality criterion is the record value (current optimum estimate) of the objective functiQn.
Let f'=C(X) be the set of continuous functions on X and P be an absolutely
continuous with respect to the Lebesgue measure probability measure on (X,B). Define
Let N=m f, where m and t are natural numbers and a vector 3 be chosen in D at random
in accordance with a distribution Q(d3). We shall call an ordered pair rc=(K[f],Q) a global
random search procedure for optimizing f E f' on X.
168 Chapter 4
Theorem 4.4.2. The procedure 1t m for m> 1 dominates the procedure 1tl in C(X), that
is
(4.4.25 )
for all fe C(X), te (-00,00) and there exists a function ge C(X) such that
Proof. Denote by At=f -1 {(-oo,z)} the inverse image of the set (-oo,t) for mapping f and let
where
The inequality (4.4.25) follows now from the classical inequality between arithmetic and
geometric averages
11m
~ i LI3.
=1
~ (rrl3.)
1 i =1 1
. (4.4.26)
We know that (4.4.26) is valid for arbitrary nonnegative numbers ~l, ... , ~m and
becomes equality only for case of 131 = ... = ~m: we shall show that there exists such a
function fE C(X) for which not all ~J, ... , 13m are equal (consequently, the inequalities
(4.4.25) and (4.4.26) are strict). Choose a function f that is not equal to a constant, but
identically equals min f on the set Xl. For such a function f, for each tE (min f, max f)
one has
170 Chapter 4
m m
P(A t ) = .LP(A{'\Xi) = ~.L~i <1,
1=1 1=1
Therefore some of the quantities ~i differ and the inequality (4.4.25) is strict for each
tE (min f, max f). The proof is completed.
It should be noted that the above result does not use asymptotic extreme value theory
and is rather general, i.e. valid under nonrestrictive assumptions. Moreover, it follows
from the proof of the theorem that the set of continuous functions may be replaced by
other, more narrow classes of functions, e.g. by
The following results on the domination of stratified over independent sampling are
based on Theorem 4.4.1, together with the next statement.
Lemma 4.4.2. The function 'I'U ,v) determined by formula (4.4.9) for each v>O is
strictly increasing in f for f~l, 'I' (l,v)<1 and
lim 'I'(f, v) = 1.
f-7 OO
Proof. We have
. . ~vr(~ + 1)
lim 'I'(f, v) = lim ru + 1 + v) =
f-7 OO f-7 OO
Let us show finally that the function 'I'(f,v) is a strictly increasing in f, i.e. for all
v>O, f=1,2, ... the inequality
holds. Indeed
Statisticalln!erence 171
v
'¥(~ + 1, v)/'¥(~, v) = (1 + 1/0 /(1 + v/(~ + 1»),
'¥(t + 1, v)
\jf(v)=log '¥(t, v) =vlog(l+lIO-log(I+v/(t+I»),
Therefore the function \jf(v) is monotone increasing for v>O and \jf(O)=O, whence \jf(v)>O
for each v>O. Now it follows that
(0)
K[f] = ( K [t], ... ,K
(k»)
[t]
where K(i)[f]=11N-i> 11 1.$.....$.11N are the order statistics corresponding to the sample
Y={Yi= f(Xi), xiES }.
For comparing global random search strategies we shall choose the vector criterion
(4.4.27)
where O.$.k.$.t-l,
Proposition 4.4.1. If the conditions of Theorem 4.4.1 hold, then the procedure 1tm
dominates the procedure 1tl with R-probability one.
The proof consists of using Lemma 7.1.2 (proved in Chapter 7) and (4.4.7) from
which it follows that with R-probability one, the asymptotic relation
holds and applying Lemma 4.4.2 according to which the right hand side of (4.4.28) is
less than 1.
From Proposition 4.4.1 follows that while applying stratified sampling the records of
the objective function are closer to the value M=max f and, thus, using them more
accurate statistical inference for M can be constructed. As it follows from (4.4.28) the
best with respect to the criterion (4.4.27) is the stratified sampling with 1=1, i.e. with a
maximal degree of stratification. This finding suggests that decreasing randomness, in
general, improves the efficiency of a global random search procedure.
Other quality criteria for global random search procedures n; are possible: let us give
three further such criteria and obtain for them results, analogous to Proposition 4.4.1.
As for K[f], we choose now a consistent linear estimate of the parameter M,
constructed on the base of the k+ 1 (O:::;k:::; f -1) maximal order statistics 11N,'''' 11N-k with
fixed coefficients ao, ... ,ak (satisfying the condition (4.4.19), i.e.
k
K[f] = 2, a i 11 N-i' ( 4.4.29)
i=O
( 4.4.30)
then by virtue of Lemma 7.1.5 and (4.4.17), under the fulfilment of the conditions of
Theorem 4.4.1 and the unbiasedness condition (4.4.19), with R-probability 1, the
asymptotic relation
Then, by virtue of (4.2.17) and (4.4.20), under the conditions of Theorem 4.4.1, there
holds
m --* 00,
with R-probability one. Due to Lemma 4.4.2, <l>U,2/a)<1 and thus the procedure n;m
dominates the procedure n;1, according to this criterion.
Consider, finally, as K[f] the confidence interval of confidence levell-y
Statistical Inference 173
for M, and its mean length as <l>f(1t) . By virtue of Lemma 7.1.2 and (4.4.23), if the
conditions of Theorem 4.4.1 are met, then with R-probability 1 the relation
m --too,
is valid. Its consequences are identical to those formulated in Proposition 4.4.1 (with the
indicated replacement of criterion <l>f(1t)).
Some results of this subsection (namely, Theorem 4.4.2 and relation (4.4.31» were
formulated in Ermakov et al. (1988) which contains also the following formulation of the
result concerning the admissibility of stratified sampling.
Proposition 4.4.2. Let K[f], <Df(1t) and:F be the same as in Theorem 4.4.2, 1'1. be the
set of all probability measures on D=X N , m=N, 1=1. Then the procedure 1tN=(K[f]'
QN) corresponding to stratified sampling with maximal stratification number is admissible
for the functional class :F in the set II= {1t=(K[f],Q), QE 1'1.} of all global random search
procedures: in other words, there is no procedure 1tE II such that 1t dominates 1tN.
The proposition states that the global random search methods using stratified sampling
can not be improved for all functions fE :F simultaneously.
This result is similar to the one of the admissibility of Monte Carlo estimates of
integrals, see Ermakov (1975). For brevity,we shall not give a complete proof of the
above proposition but present only its main ideas.
The proof of Proposition 4.4.2 uses the proof of Theorem 4.4.2 and contains two
stages. The first one shows that the distribution Q(d3) of the procedure 1t that dominates,
perhaps, the procedure 1tN has the uniform marginal distribution of components of a
vector 3 and this distribution may be chosen symmetric. From the existence supposition
of f from:F such that for a symmetric Q(d3) with uniform marginal distribution of
components, the probability Pr{K[fl2:t} is strictly larger than
for ~ = 1, m=N and some t, the existence of gE:F such that for each tE (min g, max g) an
inverse strict inequality holds, is deduced at the second stage. For details, see Ermakov et
al. (1988).
174 Chapter 4
is the region of attraction of xi*. The value 8i will be referred to as the share of the i-th
local maximizer xi* (with respect to the measure P). It is evident that
~
8.I >0 for i = 1, ... , ~ and L8.=1.
i=l I
and the random vector (N 1> ... ,N~) follows multinomial distribution
Statisticalln!erence 175
where
~
I. n. = N, N
( n 1,···,n ) N! n. ~ 0 (i=l, ... ,£).
= n' n I'
i=1 I ~ r'" ~.
I
The problem is to draw statistical inference on the number of local maximizers f, the
parameter vector 8=( 81,"" 8 r), and the number N * of trials that guarantees with a given
probability that all local maximizers are found.
A main difficulty consists in that f is usually not known. If an upper bound for £ is
known, then standard statistical methods can be applied; the opposite case is more difficult
and the Bayesian approach is applied.
Let U be a known upper bound for f and N~U. Then (N I/N, ... ,N rIN) is the standard
minimum variance unbiased estimate of 8 where NilN is the estimator of the share of the
i-th local maximizer xi*' Of course, for all Nand f> 1 it may happen that Ni=O but 8pO.
So, the above estimator nondegenerately estimates only the share 8i such that Ni>O.
Let W be the number of Ni's which are strictly positive. Then for a given f and
8=(81,"" 8 f) we have
For instance, the probability that the local search will lead to the single local maximizer is
equal to
~ N
Pr{W=118}= I.8.,
i=1 I
furthermore, the probability that all local maxima will be found equals
Pr{W=fI8}= I.
N
( n, ... ,n
)nl
8
1 ... 8
n~
. (4.5.1)
n 1+. - +n I =N 1 ~ ~
n.>O
I
The probability (4.5.1) is small if (at least) one of the 8i'S is a small number, even for
large N. On the other hand, for any f and 8 we can find N* such that for a given qE (0,1)
176 Chapter 4
we will have Pr{W=~ I O}~q for all N~N*. The problem of finding N*= N*(q,O) means
to find the (minimal) number of points in oS such that the probability that each local
maximizer will be found is at least q.
Set
Hence the problem of finding N*(q,O) is reduced to that for N*(q,O*) where
0*=( ~ -1 , ... , ~ -1 ). The latter is easy to solve for large N as
~ .. N
=I (_I)lC1(l-iiO -exp{-~exp{-N/n}, N -too.
i=O ~
Solving the inequality exp( -~ exp(-Nt m~q with respect to N we find that
For instance, for q=0.9 and ~=2, 5, 10, 100 and 1000 the values of N*(q,O*) are equal to
6, 19,46,686 and 9159, respectively. (Of course, N*(q,O*) is greater than L)
Let there be given the prior probabilities CXj 0=1,2, ... ) of events that the number ~ of local
maximizers of f equals j together with conditional prior measures Aj(dai) for the parameter
vector OJ=(Ol, ... ,Oj) under the condition ~=j. We shall assume that the measures Aj(dai)
are uniform on the simplices
Statistical Inference 177
S. > 0,
1
t
i=I
S.
1
= I}.
Thus, the parameter set e, on which the unknown parameter vector 8=(81, ... ,8~) can
take its values, has the form
00
8= u8.
j=I J
(4.5.2)
where
Applying a quadratic loss function, the optimal Bayesian estimate for the total P-measure
of the domains of attractions of the hidden ~ -W local maximizers (i.e. of the sum of the
Si's corresponding to the hidden maximizers) is given by
00' W 00
The optimal Bayesian procedure for testing the hypothesis Ho: ~=W under the alternative
HI: bW is constructed in a similar way. According to it, Ho is accepted if
00
178 Chapter 4
otherwise Ho is rejected. Here Co 1 is the loss arising after accepting Ho in the case of
HI'S validity and clO is the analogous loss due to accepting the hypothesis HI that is
false.
The above statistical procedures were numerically investigated in Betr6 and Zielinski
(1987). In some works (see Zielinski (1981), Boender and Zielinski (1985), Boender and
Rinnooy Kan (1987)) the procedures were thoroughly investigated and modified also for
the case of equal prior probabilities aj (i.e. for the case aj=l, j=I,2, ... ). They are not
presented here, since the equal prior probabilities assumption contradicts the finiteness of
the measure (4.5.2) and is somewhat peculiar in the global optimization context (for
instance, according to it, the prior probabilities of f having 2 or 1010 local maximizers
coincide). Instead of these, we shall present below the following result.
e* = max e.
l~j~i J
hold, where e* is the share of the global maximizer, and a prior distribution for e* be beta
distribution. Then the inequality
holds for the probability of the event that the record value MN* in the above described
random multi start method equals M=max f: here r is the number of sample points XiE:::
falling into the region of attraction of the maximizer with the function value MN*'
Statistical Inference 179
under the supposition that this c.d.f. belongs to a subclass of the class of cumulative
distribution functions neutral to the right, with a view on applications to global random
search theory: essentially, the basis of the approach is a specific parametrization of the
c.d.f. (4.6.1).
there exist nonnegative independent random variables ql, ... ,qm such that the random
»
vector (F(tl), ... ,F(tm is distributed as
Some properties of c.d.f.'s neutral to the right, are presented below without proofs.
Pl. IfF(t) is neutral to the right, then the normalized increments
are mutually independent for each collection (4.6.2) such that F(tm-l)<1.
P2. A wide c.d.f. class can be approximated by c.d.f.'s neutral to the right.
P3. The posterior c.d.f. of a c.d.f. neutral to the right is neutral to the right.
P4. Most c.d.f.'s F(t) neutral to the right are such that the posterior c.d.f. F(t I Y),
given a sample Y={Ylo ... ,YN} from F, depends not only on the number NA of
x's that fall into A but also on where they fall within or outside A.
P5. A random c.d.f. F(t) is neutral to the right, if and only if it has the same
probability distribution as l-exp{ -1;(t)} for some a.s. nondecreasing, right-
continuous, independent-increment random process 1;(t) with
Within the class of c.d.f.'s neutral to the right, apart from rather trivial cases, only the so-
called Dirichlet processes do not hold the property P4. For this reason, the posterior
distributions of these processes are easy to handle and are more widely used in
180 Chapter 4
applications (for instance, in some discrete problems generated by global random search,
see Betra and Vercellis (1986), Betra and Schoen (1987)). But the Dirichlet processes are
too simple and thus unable to approximate the c.d.f. (4.6.1) accurately enough. To this
end, another subclass T of the c.d.f.'s neutral to the right will be considered.
A neutral to the right c.d.f. F(t) is an element of T if the corresponding random
process
v>O, (4.6.4)
(4.6.5)
for the moments of a neutral to the right c.d.f. F(t). For FE T, together with (4.6.5), this
gives the expression
y (t)
E{[1-F(t)]m} = [1../(1.. +m)] , m= 1,2, .... ( 4.6.6)
y(t)
1- ~(t) = (1..1(1.. + 1))
from (4.6.6) with m=l. This yields the representation
for yet).
Statistical Inference 181
In order to see the role of parameter A. consider the variance of 1-F(t) which is
represented as
g (A.) 2
var(l- F(t») =(1- (3(t») - (1- P(t») (4.6.8)
Proposition 4.6.1. Let Fe T, 111 , ... ,11N be the order statistics corresponding to an
independent sample Y={Yl, ... ,YN} from F. Set mj=N+A.-j, 110=-00,
If the moment generation function for the process (4.6.3) has the form (4.6.5) and ')'(t) is
continuous at the points Yl> ... ,YN, then the moment generation function corresponding to
the posterior distribution is equal to
(4.6.9)
(4.6.10)
thus (4.6.9) presents the analytical form for the characteristic function of the posterior
distribution of a gamma process ~(t)= -log(1-F(t». This way one can obtain the posterior
distribution of F by numerical evaluation of a Fourier integral. Once a posterior
182 Chapter 4
distribution ofF(t), given the sample, is known, testing some statistical hypothesis about
F can be performed in the framework of Bayesian statistics. We shall now describe how
the statistical hypothesis about quantiles ofF are tested.
It is shown here that in a natural setup the problem of testing a statistical hypothesis about
random c.d.f. quantile is reduced to a single computation of the posterior probability.
Let F(t) be a random c.d.f., tp be the p-th quantile of F, Y={Yl, ... ,yN} be an
independent sample from F, t * is a given constant. The problem is to test the hypothesis
Ho: tp~t* which can be rewritten as Ho: F(t*)~p. Let dey) be a decision function
assuming two values do and d 1, corresponding to acceptance and rejection of H o, and the
losses connected with do and d 1 are
where Co and CI are given positive values. The posterior mean values of L(F,di) are
(4.6.11)
otherwise.
It is shown below how the result cited can be applied for controlling the precision of some
simple global random search algorithms (like Algorithm 3.1.1).
Let P be the uniform measure on 'X, 3={Xlo... ,xN} be an independent sample from
P, f be a bounded measurable function, M=vrai sup f, f * be the record, and the
maximization problem of f is stated in terms of obtaining a point from A(e)={x:M-f(x)g}
where e is a given value. Note that
(4.6.12)
Statistical Inference 183
The statistical problem is to test the hypothesis Ho: f *E A(e) that can be rewritten in the
forms Ho: F( f *)~1- 10 and Ho: f *~ f 1-10 where f 1-10 is the (l-e)-quantile of the c.dJ. F(t)
defined by (4.6.1).
Underthe assumption that F is a random c.d.f., the decision rule (4.6.11) can be used
to test the hypothesis Ho (it is natural to choose e.g. co=q =1). In order to apply the
results of Section 4.6.1, we suppose that FET and consider the choice problem for
parameter A and function -y(t) that determine the gamma process S(t)= - log (I-F(t».
Using (4.6.7) we obtain for each tE R, OE (0,1)
u
r(u,v) = f t v-I e- t dt/r(v).
o
Before drawing statistical inference, we demand the following prior information: let for
some pair a, OE (0,1) a value f o,a be given such that
(4.6.14)
It may happen that (4.6.15) is unsolvable. In this case either (4.6.14) or ~ should be
modified. The properties of the incomplete gamma function imply that the equation
(4.6.15) has solution if 1- ~<f o,a)~o(1-a).
As for ~(t), it is natural to set it in a parametric form with subsequent estimation of the
parameters on the basis of the empirical c.d.f.
184 Chapter 4
For instance, if the parameters of ~ are the mean Jl and variance (J2, i.e.
(4.6.16)
-1 N
11 =N LY·, (4.6.17)
i=1 1
Algorithm 4.6.1.
1. Assume that the prior c.d.f. ~(t) for F(t) has the form of (4.6.16).
2. For some a., 5E(O,I), take f5,a. satisfying (4.6.14).
3. Obtain a sample 3={xt.... ,xN} by sampling a distribution P.
4. Evaluate Yi=f(Xi) for i=1, ... ,N.
5. Estimate Jl and (J by (4.6.17)
6. Find A. by solving (4.6.15). If the equation (4.6.15) has no solution, then increase
(J until it becomes solvable.
7. Determine ')'(t) from (4.6.7).
8. Obtain f * (for instance, set f * =T\N=max {Yl ,···,YN})·
9. Estimate the value Pr{F( J* )~I-E IY} by numerical evaluation of a Fourier integral
from the characteristic function determined by (4.6.9) and (4.6.10).
10. Test the hypothesis Ho: J*EA(E) by (4.6.11).
11. If Ho is accepted, then terminate the algorithm, otherwise sample No times the
distribution P and return to Step 4 substituting N+No for N (No is some fixed natural
number).
The complexity of Algorithm 4.6.1 and its modifications is rather high, since several
times one needs to solve the complicated equation (4.6.15) and compute a value of c.d.f.
via the characteristic function. Hence, if f is easy to evaluate, it is unprofitable to use the
approach presented.
A natural way of applying the above described testing procedures of statistical
hypothesis, concerning quantiles of c.d.f.'s neutral to the right, consists of using them in
branch and probability bound methods for evaluating a prospectiveness criterion. To
construct the prospectiveness criterion, one can apply Steps 1-10 of Algorithm 4.6.1,
substituting Z for X and the record value of f in Z for f *. Acceptance of the hypothesis
Ho: f * ;? f 1-£ for £Z 0 corresponds to the decision that Z does not contain a global
maximizer of f, i.e. Z is unpromising for further search.
CHAPTER 5. METHODS OF GENERATIONS
The methods studied in this chapter consist of sequential multiple sampling of probability
distributions, asymptotically concentrated in a vicinity of the global optimizer. These
methods form the essence of numerous heuristic global random search algorithms and can
be regarded as a generalization of simulation annealing type methods in the sense that
groups of points are transformed to groups, rather than points to points.
The methods of generations are rather simple to realize, but are not very efficient for
solving easy global optimization problems. Nevertheless, numerical results demonstrate
that they can be applied even for solving very complicated problems (the author used them
for solving some location problems in which the number of variables exceeded 1(0).
Many methods of generations are suitable for the case when the objective function is
subject to random noise: this is the case generally considered in this chapter. For
convenience, we shall suppose here (as well as in the preceding chapter) that the
maximization problem of f is considered. Besides, the condition XcR.n is relaxed in this
chapter, and the feasible region X is supposed to be a compact metric space of an arbitrary
kind.
The theoretical study of the methods of generations is the main aim of the present
chapter which is divided into four sections. Section 5.1 describes some approaches to
algorithm construction and formulates the basic model. Section 5.2 investigates the
convergence and the rate of convergence of the sequences of probability measures
generated by the basic model. Section 5.3 studies homogeneous variants of the methods:
its results are closely connected with the theory of Monte Carlo methods. Finally, Section
5.4 modifies the main methods of generations in a sequential fashion and investigates the
convergence of these modifications.
5.1.1 Algorithms
Beginning in the late 1960's, many authors suggested heuristic global random search
algorithms based on the next three ideas: (i) new points at which to evaluate f are
determined mostly not far from some best previous ones, (ii) the number of new points in
the vicinity of a previously obtained point must depend on the function value at this point,
(iii) it is natural to decrease the search span while approaching to a global optimizer. Such
algorithms are described e.g. in McMurty and Fu (1966), Rastrigin (1968), Bekey and
Ung (1974), Ermakov and Mitioglova (1977), Ermakov and Zhigljavsky (1983),
Zhigljavsky (1981, 1987), Price (1983, 1987), Masri et al. (1980), Pronzato et al. (1984)
and in many other works.
A general algorithm relying upon the above ideas is as follows.
Algorithm 5.1.1.
1. Sample some number N 1 times a distribution PI, obtain points Xl (l), ... ,XNl(1); set
k=1.
186
Methods o/Generations 187
2. From the points Xj(i) G=I, ... ,Ni; i=I, ... ,k) choose ~ points xl *(k), ... ,x~*(k)
having the greatest values of f.
3. Determine the natural numbers
(j = 1, ... , ~
applying some rule.
4. Sample nkj times the distributions Qk( xj*(k),dx) for all j=I,2, ... , t and thus obtain
the points
(k+l) (k+l)
xl "",x N . (5.1.1)
k+l
Algorithm 5.1.1 becomes a special case of Algorithm 3.1.5 (i.e. the general scheme of
global random search algorithms) ifnkj of Algorithm 5.1.1 is used in the latter as Nk.
The lack of search span decrease IS equivalent to defining the transition probabilities
Qk(z,,) so as to meet Qk+l(z,B(z,E))~Qk(z,B(z,E)) for all k~l, zEX., E >0. The
particular choice of the rate of this decrease depends on the prior information about f, on
magnitudes of Nk and on the requirements to the accuracy of extremum detem1ination. As
it was established in Section 3.2, if one does not want to miss the global optimizer, then
the span must decrease slowly. For the sake of simplicity, the sampling algorithm for
Qk(z,.) is often defined as follows: a uniform distribution on X. is sampled with a small
probability Pk~O and a uniform distribution on a ball or a cube (their volume being
dependent on k) with centre in Zoo is sampled with probability 1-Pk. In this case the
condition
Nk
P
k+l
(dx) =
.
L p~k)Q
J
(x~k) dX)
k\ J '
(5.1.2)
J=1
where
Nk
p(k)=f (x~»)/Lf (x~)\
k k J . k 1)
1=1
The distribution (5.1.2) is sampled by the superposition method: at first the discrete
distribution
(5.1.3)
is sampled and then the distribution Qk(Xj(k),.), if Xj(k) is the realization obtained.
Since the functions fk are arbitrary, they may be chosen in such a way that the mean
values Enkj of numbers nkj of Algorithm 5.1.2 correspond to the numbers nkj of
Algorithm 5.1.1. Allowing for this fact and fot' the procedure of determining quasi-
random points with distribution (5.1.3), one can see that Algorithm 5.1.1 is a special case
of Algorithm 5.1.2.
In theoretical studies of Algorithm 5.1.2 (more precisely, of its generalization
Algorithm 5.1.4) it will be assumed that the discrete distribution (5.1.3) is sampled in a
standard way, i.e. independent realizations of a random variable are generated with this
distribution. In practical calculations, it is more advantageous to generate quasi-random
points from this distribution by means of the following procedure that is well known in
the regression design theory (see Ermakov and Zhigljavsky (1987» as the construction
procedure of exact design from approximate ones. (Note that this is also a simple variant
of the main part extraction method used for reduction of Monte Carlo estimate variance at
calculation of integrals, Ermakov (1975». Set rj(k), the greatest integer, in Npj (k),
N
k (k)
_
N(k)- Lr. ,
j=l J
Methods o/Generations 189
Then
where
Instead of Nk-fold sampling of (5.1.3), one can sample Ek(2) N(k) times and choose rj(k)
times the points Xj(k) for j=I, ... ,Nk. The above procedure reduces the indeterminacy in
the selection of points Xj(k) in whose vicinity the next iteration points are chosen according
to Qk(Xj(k),.). In the case of using this procedure, these points include some best points
of the preceding iteration with probability one. Besides the procedure is independent of the
method of determining Pj(k) (and is, therefore, usable in the case when evaluations of f
are subject to random noise).
The quality of the variants of Algorithm 5.1.2 greatly depends on the choice of fk that
should reflect the properties of f (e.g, be on the average greater, where f is great or
smaller, where f is small). Their construction should be based on some technique of
objective function estimation or on a technique of extracting and using infoffilation about
the ob)ective function during the search. Various estimates
fk of f
can be used as fk (in this case
Algorithm 5.1.3.
1. Choose a distribution PI, set k=l.
2. Sample Nk times the distribution Pk, obtain points xl (k),,,,,XNk(k). Evaluate f at
these points. If
190 ChapterS
(5.1.4)
k
f (x)~(dx)1 It k(z)~(dz)
and, therefore, weakly converge for k~oo to the distribution concentrated at the global
maximizer of f.
Stopping rules for these or similar algorithms may be constructed through the
recommendations of Chapter 4 (termination takes place when reaching the accuracy
desired); the simplest rule (the prescribed number of iterations is executed) can be chosen
as well.
Algorithms 5.1.1 - 5.1.3 (as well as the Algorithm 5.1.4 presented below) will be
called methods of generations. This name originates from the fact that these algorithms are
analogous to or direct generalizations of Algorithms 5.3.3., 5.3.4, for which methods of
generations is the standard terminology in the theory of the Monte Carlo methods.
The algorithm presented below is a generalization of Algorithm 5.1.3 to the case, when fk
are evaluated with random noise: it is a mathematical model of Algorithms 5.1.1 through
5.1.3 and their modifications.
Algorithm 5.1.4.
Measures Pk+ 1(dx), k=1,2, ... , defined through (5.1.2), are the distributions of random
points x/k+ 1) conditioned on the results of preceding evaluations of f. We shall study in
this chapter their unconditional (average) distributions which will be denoted by
P(k+ 1,Nk;dx). Obviously, the unconditional distribution of xP) is PI (dx)=P(k,No;dx).
192 Chapter 5
5.2.1 Assumptions
For the sake of convenience, the assumptions used in this chapter and their explanation are
formulated separately. Assume that
(a) ~k(x) for any xe'X and k=1,2, ... are random variables with a zero-expectation
distribution Fk(x,d~) concentrated on a finite interval [-d,d]; the random variables
~kl(xl)' xk2(x2),··· are mutually independent for any kl,k2, ... and xI,x2, ... from 'X;
(b) Yk(x)=fk(x)+~k(x).;~.ct>O with probability one for all xe'X, k=I,2, ... ;
(c) O<qsfk(x)SMk=SUP fk(x)SC<oo for all xe 'X, k=1,2, ... ;
(d) the sequence offunctions fk(x) converges to f<x) for k~oo uniformly in x;
(e) Qk(z,dx)=<lk (z,x)f.1(dx),
Z,XEX
M N
PMCdx l' ... ,dx M) = f)lC dElN)PlaCElN) i~/(zi'~i,dX j) (5.2.1)
Z
where
z ='X x [ - d,d],
(h) the global maximizer x* of f is unique, and there exists c>O such that f is
continuous in the set B(x*,c)=B(c);
Methods a/Generations 193
(i) J..l is a probability measure on (X,:B) such that J..l(B(E»>O for any E>O;
G) there exists Eo>O such that the sets A(E)::::{xeX: f(x*)-f<xb;E} are connected for
any E, 0< E~Eo;
(k) for any xe X and k~oo the sequence of probability measures Qk(x,dz) weakly
converges to Ex(dz) which is the probability measure concentrated at the point x;
(1) for any xe X and k~oo, the sequence of probability measures R(k,Nk,x;dz)
weakly converges to Ex(dz);
(m) for any E>O there are 0>0 and a natural ko such that Pk(B(E»~O for all k~ko;
(n) for any E>O there are 0>0 and a natural ko such that P(k,Nk_l;B(E))~O for all
k~ko;
(0) the functions fk (k::::I,2, ... ) are evaluated without random noise;
(p) the transition probabilities Qk(x,.) are defined by
Qk(x,A) = JI[ 2E A,j k(x):5j k(z)] T k(x,dz) + I A (x)Jl[J k(z)<j k(x)] T k(x,dz)
(5.2.2)
where Tk(x,dz) are transition probabilities, weakly converging to Ex(dz) for k~oo and for
allxeX;
(q) PI (B(X,E»>O for all E>O, xe X;
(r) the transition probabilities Qk(x,dz) are defined by
(5.2.3)
in a form where (b) is fulfilled. Indeed, if (b) is not met for a regression function h, then
one can determine ak~O in such a way that the probability of the event {sup Isk(x) I.::;ak}
is equal or almost equal to 1 and, instead of Yk(x), compute
where
~
k
(x) = {S0 k(x) if y k(x) - Yk(x O) + 2a k ~cl
otherwise
and Xo is an arbitrary point from 'X. In this case a function, that is made arbitrarily close
to max {CJ '/ k(x)+const} by appropriate choice of ak, is the regression function, not
hex).
The assumption (c) whose major part is a corollary of (b) will be used for convenience
in some formulations.
The assumptions (h), (i) and G) are natural and non-restrictive. The uniqueness
requirement concerning x* imposed in order to simplify the formulations. This
requirement may be relaxed: considering the results presented below one can see that, in
fact, one deals with distribution convergence to some distribution concentrated on the set
rather than with convergence to Ex*(dx) and, therefore, if the unique maximizer
requirement is dropped, then convergence can be understood in this sense.
Conditions (e), (k) and (1) formulate necessary requirements on the parameters of
Algorithm 5.1.4 that may always be satisfied.
Assumptions (f), (g) and (s) are nor requirements but only auxiliary tools for
formulating Lemma 5.2.1. For Theorem 5.2.1, a similar role is played by (m) and (n) that
can be also regarded as conditions imposed on the parameters of Algorithm 5.1.4. They
are not constructive, however, and therefore easily verifiable conditions sufficient for
validity of (m) or (n) are of interest. The conditions (p), (q), (r) serve these aims for two
widely used forms of transition probabilities. The choice of a realization Yk of the random
element with the distribution Qk(x,dYk) as defined through (5.2.2) implies that first one
has to determine a realization ~k of the random element with distribution Tk(X,d~k) and
then set
Y
k
={~k
x
if / k(~J ~ / k(x)
otherwise.
Methods o/Generations 195
The two Lemmas below are more important than the stages of proof in Theorem 5.2.1.
Lemma 5.2.1 will be used in Sections 5.3 and 5.4, and the statement of Lemma 5.2.2
contains very weak conditions sufficient for weak convergence of the distribution
sequence (5.2.9) to E*(.)= Ex*(')' that are of independent interest.
Lemma 5.2.1. Let the assumptions (a), (b), (c), (e), (f), (g) and (s) be fulfilled. Then
1. the random variables with the distribution PM (dx 1> ... ,dxM) are symmetrically
dependent;
2. the marginal distributions PM(dx)=PM(dx, 'X, ..., 'X) is representable as
(5.2.4)
where RN(dz)=RN(dz, 'X, ... , 'X), the signed measures L\N converge to zero in variation
for N~oo with the rate N-l12, i.e. var( L\N )=O(N-l/2 ), N~oo.
Proof The first statement follows from (f) and (g) and from the definition of symmetrical
dependence. Let us represent the marginal distribution PM(dx) as follows:
196 ChapterS
N
I>M(dx) = ~I1(d8N)a(8N\~IA(zi'~i,dX) =
Z
N
= i~ INI1(d8N)a(8N)A(zi'~i'dx) =
Z
where
In order to prove this, we prove that for any 0>0 and xe'X there exists N*=N*(o,x) such
that for N~.N* there holds
(5.2.6)
The symmetrical dependence of the random variables Y(Xi)=j (Xi)+~(Xi) (i=l, ... ,N)
follows from the symmetrical dependence of random elements X1, ... ,XN and condition
(a). In virtue of the above and Loeve (1963), Sec. 29.4, the random variables
converge in mean for N~oo to some random variable f3 independent of all f3i, Y(Xi)
i=1,2, ... , and
Methods o/Generations 197
This can be formulated as follows: for any (; 1 >0 there exists N* 2:.1 such that
E I ~N-~ I<01 for N~N*. Denote 'I'=(t(XO+~(XO)q(XloX). Exploiting the independence
of ~ from ~N and '1', the condition (a), (b), (c) and (e) for the case (s), and also the fact
that
Thus, if one takes 0= 0lC12/C, then (5.2.6) will be met for N?.N*. Moreover, it follows
from the last chain of inequalities that var (L1N)$.CI -2[E II3-~N I. From the central limit
theorem for symmetrically dependent random variables, see Blum et al. (1958), and the
inequality
which is a special case of the inequality given in Loeve (1963), Sec. 9.3, it follows that
E 1~-~N 1=O(N-l/2), N~oo. Consequently, var (L1N)=O(N-l/2), N~oo. The lemma is
proved.
By substituting tk, Nko Nk+1, P(k,Nk-1;.), P(k+I,Nk;·)'
P(k+I,Nk;dx)=P(k+1,Nk;dx,'X, ... ,'X), L1(k,Nk,.), respectively, for t,
N, M, RN(.),
PM(.), PM(dx), and L1N(.) and applying Lemma 5.2.1, we obtain the following
assertion.
Corollary 5.2.1. Let (a), (b), (c) and (e) be met. Then for any k=I,2, ... and
Nk=I,2, ... the following equality holds for the unconditional distribution of random
elements Xj(k):
P(k + l.Nk;dx) =
-1
= [fP( k,N k_l;dz)t k(z)] JP( k,N k_l;dz) t k(z)R(k,N k' Z; dx) (5.2.7)
198 Chapter 5
where
for any k=1,2, ... the signed measures d(k,Nk;.) converge in variation to zero for N~oo
with the rate of order Nk-l/2.
The next corollary follows from the above.
Corollary 5.2.2. Let (a), (b), (c) and (e) be met. Then for any k=1,2, ... the sequence
of distributions P(k+ 1,Nk;.) converges in variation for Nk~oo to the limit distributions
Pk(.) and
(5.2.8)
Lemma 5.2.2. Let the conditions (c), (d), (h), i and G) be met. Then the sequence of
distributions
J
f m(x)fl(dx)/ f m(z)fl(dz) (5.2.9)
:0.I = 'X.\B. = {x E
I
'X.: Ilx - x*11 ~ e}, i = 1, ... ,4, K1 = sup f(x).
xeD 1
Choose an arbitrary value q >0. It follows from (h) that for any q >0 there exists e2,
0<£2<Q, such that
f(X))m
J(~ fl(dx) > j (f(X))m (K2)m
~ fl(dx) ~ J K1 fl(dx)
1 2 2
we obtain
whence
(5.2.10)
Choose now an arbitrary function 'I'(x) that is continuous on 'X. By the definition of weak
convergence, it suffices to demonstrate that
(5.2.11)
For any <»0 there exists 103>0 such that I 'I'(xHI'(x*) I <0 for all XE B3. Setting
q=min{q,E3J we have
whence the validity of (5.2.11) follows in virtue of (5.2.10). The lemma is proved.
200 Chapter 5
Below sufficient conditions are determined for weak convergence of the distribution
sequences (5.2.7) and (5.2.8) to E*(dx) for k--7 oo •
Theorem 5.2.1. Let the conditions (c) through (e) and (h) through G) be satisfied as
well as (k) and (m) or (1) and (n). Then the distribution sequence determined through
(5.2.8) (or, respectively, through (5.2.7)) weakly converges to E*(dx) for k--7 oo •
Proof. We consider only (5.2.8), because for (5.2.7) the proof is essentially the same, but
the formulas are more tedious. Choose from (5.2.8) a weakly convergent subsequence
Pki(dx) (this is possible in virtue of Prokhorov's theorem, see Billingsley (1968)) and
denote the limit by Il(dx) where 11 stands for a probability measure on (X,13). It follows
from (5.2.8) that the subsequence Pki+ 1(dx) weakly converges to the distribution
Ql (dx)=CIfCx)ll(dx), where CI is the normalization constant, and, similarly, Pki+m(dx)
converges to Qm(dx) of the form of (5.2.9). By means of the diagonalization one can
show that there exists a subsequence PkjCdx) that converges weakly to E*(dx).
In virtue of Theorem 2.2. of Billingsley (1968), the set of all finite intersections of
open balls with centers from countable and everywhere dense in X set and with rational
radii is the countable set defining convergence. Extract from the above set a subset 5
consisting of sets of Q1-continuity. Enumerate the elements of 5,
5={Aj}00
j=l
Fix a monotone sequence of numbers
{Em}"", Em> 0, Em --70 as m --7 00•
m=l
Since Pki+m(A)--7Qm(A) for any AE 5 as i--7 oo , there exists a subsequence
Rl ,m (dx)=P k . +m (dx)
'm
for which the inequality I R 1 meA l)-Qm(A 1) I< Em is valid for any m=I,2, ... Extract in
a similar manner from the seq~ence
For any AiE 5, ism, the diagonal subsequence {Rm,m(dx)} has the property
IRm,m(Ai)-Qm( Ai) I<Em. This subsequence weakly converges to E*(dx): indeed, for all
AiE5,
Here the flrst term for m~i does not exceed Em and therefore approaches zero for m~oo;
the second term approaches zero in virtue of Lemma 5.2.2 where (m) plays the role of (i).
Thus there is a subsequence Pkj(dx) converging to E*(dx). It follows from (5.2.8) that
Pkj+l(dx) converges to the same limit and thus any subsequence ofPk(dx) converges to
this limit. The same holds for the sequence itself. The theorem is proved.
Let us note that all the previously used conditions (with the exception of (m) and (n»
are evident and natural. It is desirable to derive conditions that imply (m) and (n). The
corollaries of Theorem 5.2.1 presented below formulate the sufficient conditions for
distribution convergence to E*(dx) for two theoretically most important ways of choosing
the transition probabilities Qk(z,dx).
Corollary 5.2.3. Let the conditions (c), (d), (e), (h), (i), (j), (0), (p), (q), (t) and also
(k) for the transition probabilities Tk(x,dz) of (5.2.2) be satisfled. Then the sequence of
distributions determined by (5.2.8) weakly converges to E*(dx) for k~oo.
Proof It follows from (q) and (h) that PI (A(o»>O for any 0>0, and from (5.2.2) that
Pk(A(o»~",~Pl (A(o» for any 0>0, k=2,3, ... and therefore (m) is met. All conditions of
Theorem 5.2.1 concerning the sequence (5.2.8) are satisfled: the corollary is proved.
Corollary 5.2.4. Let the conditions (t), (e), (h), (i), (j), (q) and (r) be satisfied. Then
the sequence of distributions determined through (5.2.8) weakly converges to E*(dx) for
k---+oo.
Proof. Under our assumptions the distributions (5.2.8) have continuous densities with
respect to the Lebesgue measure. Denote them by Pk(x), k=I,2, ... It follows from
(5.2.3) that Pk(x»O for any k~1 and those XE'X for which f(x)",O. Let us show that (m)
is satisfled. Fix 0>0. It follows from (5.2.8) and the finiteness of cp that for any k and E>O
the following inequality holds
(5.2.12)
where Ek~O are defined in terms of 13k and the sizes of the support of density cp,
00 00
(5.2.13)
Proof. Repeat the proof of Corollary 5.2.4 changing only (5.2.12) and (5.2.13). Let us
require that Nk be so large that for any k the inequality
P(k+1,Nk;A(£+Ek))~(1-0k)P(k,Nk_1;A(E))
is satisfied instead of (2.12), where
~0\!I (I-oJ
o
To complete the proof, it remains to exploit the fact that if (5.2.14) is satisfied then
I1(1-0J>O.
k=1
Let us introduce some notations that will be used throughout this section.
Let X be a compact metric space; :8 be the a-algebra of Borel-subsets of X; 1'1, be the
space of finite signed measures, i.e. regular (countable) additive functions on :8 of
bounded variation; 1'1,+ be the set of finite measures on :8 ( 1'1,+ is a cone in the space 1'1,);
1'1,+ be the set of probability measures on :8(1'1,+cn+); C+(X) be the set of continuous
non-negative functions on X (C+(X) is a cone in C(X), the space of continuous
functions on X); C+(X) be the set of continuous positive functions on X (C+(X) is the
interior of the cone C+(X»; a function K: Xx:8-+R. be such that K(.,A)E C+(X) for
each AE:8 and K(X,.)En+ for each XEX. The analytical form of K may be unknown,
but it is required that for any XE X a method be known for evaluating realizations of a
non-negative random variable y(x) such that
and of sampling the probability measure Q(x,dz)=K(x,dz)/f(x) for all XE {XEX: f(x)",O}.
Denote by :K. the linear integral operator from 1'1, to 1'1, by
As it is known from the general theory of linear operators (see Dunford and Schwartz
(1958», any bounded linear operator mapping from a Banach space into C(X) is
representable as (5.3.2) and IILII=II:K.II=supf(x). Moreover, the operators:K. andL are
completely continuous in virtue of the compactness of X and continuity of K(.,A) for all
AE:8.
As is known from the theory of linear operators in a space with a cone, a completely
continuous and strictly positive operator L has eigen-value A. that is maximal in modulus,
positive, simple and at least one eigen-element belonging to the cone corresponds to it; the
conjugate operator L * has the same properties.
In the present case, the operator L is determined by (5.3.2). It is strictly positive
provided that for any non-zero function hE C+(X) there exists m=m(h) such that
Lmh(.)E C+(X) where Lm is the operator with kernel
204 Chapter 5
Thus, if the operator L=1(.* is strictly positive (which is assumed to be the case), the
maximal in modulus eigen-value A of 1(. is simple and positive; a unique eigen-measure P
in n + defined by
f
A = f(x)P(dx). (5.3.4)
It is evident from (5.3.3) and (5.3.4) that if A"'O, then the necessary and sufficient
condition that P is a unique in n+ eigen-measure of 1(. is as follows: P is a unique in n+
solution of the integral equation
-1
P(dx) = [f f(z)P(dz)] fp(dz)K(z,dx). (5.3.5)
Assume that for any xl,x2, ... from X an algorithm is known for evaluation of the
random variables ~(x 1),~(X2), ... that are mutually independent and for any XE X are such
that E~(x)=h(x), var ~(X)~crI2<oo where h is some function from C(X).
In the following algorithms will be contructed and studied for estimation of the
functional
of the probabilistic eigen-measure of the operator 1(.. In virtue of (5.3.4), this problem
includes the estimation of the maximal eigen-value of the integral operator (5.3.1) known
as estimation of branching process critical parameter or the problem of critical system
calculation (see Mikhailov (1966), Khairullin (1980), Kashtanov (1987»; the so-called
method 0/ generations with constant number o/particles was developed for solving this
problem. It finds wide applications in practical calculations and for its study a special
technique has been developed. This method is studied below by the apparatus of
Section 5.2.
Together with its modifications, it will go under the name generation method.
The connection between the problem under consideration and that of searching of the
global extremum of f is two-fold: in addition to the interrelation between the methodology
and technique of the algorithm investigation mentioned above, it turns out that the extremal
problems arise from the problems of estimating functionals of eigen-measures as limit
problems. This is discussed in the next section.
Methods o/Generations 205
In this section we demonstrate that, in a number of fairly general situations, the problem
of determining the global maximizer of f can be regarded as the limit case of determining
the eigen-measures P of integral operators (5.3.1) with kernels Kp(x,dz)= f(x)Qp(x,dz)
where the Markovian transition probabilities Q~(x,dz) weakly converge to ex(dz) for
~~O.
In order to relieve the presentation of unnecessary details, assume that X=R.n, f.t=f.tn
and that Q~(x,dz) are chosen by (5.2.3) with ~k=~' i.e.
(5.3.7)
Lemma 5.3.1. Let the transition probability Q= Q~ have the form (5.3.7), where <p is a
continuously differentiable distribution density on R.n,
f Ilxll<p(x)f.tn(dx) < 00
f be positive, satisfy the Lipschitz condition with a constant L, attain the global maximum
at the unique point x*, and f (x)~O for II x II ~oo. Then for any £>0 and 8>0, there exists
~>O such that P(B(8))~1-£ where P is the probabilistic solution of (5.3.5).
Proof. Multiply (5.3.5) by f, integrate it with respect to X and let ~ approach zero.
Exploiting the Lipschitz property of f and the inequality from Kantorovich and Akilov
(1977), p. 318, obtain that the variance of the random variable f (~~) (where ~~ is a
random vector with distribution P=P~) tends to zero for ~~O and, therefore f (~~)
converges in probability to some constant M. To complete the proof, we shall show that
M= f (x*). Assume the contrary; then there exist c, q>O such that P(D~»O and f.tn(D~)~c
for all ~>O, where
with B13~O for 13~0 which follows from (5.3.7): but this contradicts to the assumption.
The lemma is proved
Heuristically, the statement of Lemma 5.3.1 can be illustrated by the following
reasoning. In the case studied, P(dx) has a density p(x) that may be obtained as the limit
(for k~oo) of recurrent approximations
(5.3.8)
where
-n
cp l3(x) = 13 cp(x/13) ,
(5.3.8) implies that Pk+ 1 is a kernel estimator of the density sk+ 1Pkf, where the
parameter 13 is called window width. One can anticipate that for a small 13 the asymptotic
behaviour of densities (5.3.8) should not differ very much from that of distribution
densities (5.2.9) which converge to e*(dx), in virtue of Lemma 5.2.2.
Numerical calculations have revealed the fact that in problems resembling realistic ones
(for not too bad functions f) the eigen-measures P=PI3 explicitly tend, for small 13, to
concentrate mostly within a small vicinity of the global maximizer x* (or the maximizers).
Moreover, the tendency mentioned manifests itself already for not very small 13 (say, of
the order of 0.2 to 0.3, under the unity covariance matrix of the distribution with the
density cp).
The following example illustrates to some extent the issue on closeness of P(dx) and
e*(dx).
Example. Let X=R, f=N(a,(}"2), i.e. f is the density of the nonnal distribution with mean
a and variance (}"2, Q(x,dz) be chosen via (5.3.7), where cp=N(0,13 2 ). Now, one can
readily see from (5.3.5) that the nonnal distribution density with mean a and variance
is the density of P(dx). A similar result holds for the mutidimensional case, since the
coordinates may be considered independently, following an orthogonal transfonnation.
Now let us establish the possibility of using the generation method of the next
subsection for searching the global maximum of f: the essence being that all search points
in these algorithms have asymptotically the distribution P(dx) that can be brought near to
a distribution concentrated at the set X*={arg max f} of global maximizers by an
appropriate choice of the transition probability Q(x,dz). If this is the case, then the
majority of points detennined by the generation methods are in the limit sufficiently close
to X* and this is highly desirable for (random) optimization algorithms. (After carrying
Methods of Generations 207
out a sufficient number of evaluations of f in the vicinity of a point x*e X*, its position
can be detennined more exactly by constructing a regression model, say, polynomial
second-order one). This property is of special importance, when f is evaluated with a
random noise. Another positive aspect of the generation methods described below as
global optimization algorithms is their easy comparability in tenns of closeness of the
distributions P(dx) and E*(dx). It is noteworthy that the algorithms of independent random
sampling of points in X (Algorithms 3.1.1 and 3.1.2) can be also classified as belonging
to the generation methods described below if one assumes that Q(x,dz)=PI (dz). For these
algorithms P(dx)= PI (dx) and the points generated by them, therefore, do not tend to
concentrate in the vicinity of X * and, from the viewpoint of asymptotic behaviour, they
are inferior to those generation methods whose distribution P(dx) is concentrated near to
X*. (This way, the situation here is quite similar to the situation concerning the simulated
annealing method, see Section 3.3.2.)
Let PI (dx) be some probability distribution on (X,:B), that usually is taken to be uniform,
and
It is assumed in the description of Algorithms 5.3.1 through 5.3.3. that XcRP, PI (dx) is
the unifonn distribution on X and the random variable ~(x) takes values on the set
{O,l, ... }.
The most straightforward algorithm used for a long time for estimating A. is based on
the N-fold sampling of the general branching process (see Harris (1963» defined by
K(z,dx) and consists in the following.
Algorithm 5.3.1.
1. Sample Nl=N times Pl(dx), obtain Xl(l), ... ,XNl(l) and set k=1.
2. Set i=l, Nk+l=O.
3. Sample the random variable ~(Xi(k», obtain a realization fi(k).
4. Sample ri(k) times Q(Xi(k),dx), obtain
(k+l) (k+l)
xN +l""'x (k)'
k+l N +r
k+l i
In this method and the subsequent algorithms the number of iterations is defined by a
number I.
Since the random vectors Xi(k) asymptotically (for N 1 ~oo, k~oo) follow the
distribution P(dx) (see Harris (1963), Algorithm 5.3.1 may be applied to the estimation
of the functional (5.3.6). The estimator is constructed in a standard fashion (used in
Monte-Carlo methods):
(5.3.9)
where I~Io::;I. In particular, for 10=1, I~oo, one obtains the well-known estimator for A:
A-NI+l/NI·
From the computational point of view, Algorithm 5.3.1 is inconvenient in the sense
that for A<1 the process rapidly degenerates (all the particles, i.e. points xi (k), die or leave
X), and for A> 1 the number of particles grows with k in geometric progression so that
their storage soon becomes impossible. Modifications have been made of the algorithm
with the purpose of overcoming the latter inconvenience: they are called generation
methods with constant number of particles and are described below.
Algorithm 5.3.2.
If the amount of descendants (i.e. points xi (k+l» at the k-th step of Algorithm 5.3.1 is
Nk+ I>N 1, then N=N 1 particles (points) of the next generation are randomly chosen
from them. If Nk+ 1<N, particles of the preceding generation are added in the same
manner until their number in the new generation becomes N.
Algorithm 5.3.3.
The new generation is formed in Algorithm 5.3.1 by N-fold random choice with
return of Nk+ 1 descendants of the particles of the previous generation. If Nk+l =0, the
sampling is repeated until Nk+ 1>0.
With the use of the above algorithms, the distributions of the random vectors xi (k)
tend to P(dx) (for k~oo, N~oo): therefore, with Nk=N, the estimate (5.3.9) of (5.3.6) is
asymptotically accurate.
Obviously, the efficiency of Algorithm 5.3.2 still depends on A: this algorithm is not
Markovian and, thus is difficult to study. Algorithm 5.3.3 whose rate of convergence can
be investigated by a special technique (see Khairullin (1980» is more attractive. Let us
write Algorithm 5.3.3 in a slightly more general and convenient form. To this end, note
that at the k-th iteration of Algorithm 5.3.3, the random choice with return from the set
(5.3.10)
with probabilities
p0-)=r~k)/I r~)
1 J 1
(5.3.11)
j=l
Algorithm S.3.4.
I r~k) =
i=l 1
0,
P
k+l
(dx) = I p0-)Q(x~k)
i=l
1 1 '
dX)
Although Algorithms 5.3.3 and 5.3.4 coincide in the probabilistic sense, their
interpretations in terms of particles may differ, see Ermakov and Zhigljavsky (1985) (this
work describes also some other approaches to the estimation problem of (5.3.6)).
This section deals with the generation method as formulated in the form of Algorithm
5.3.4. The analysis, like that of Section 5.2, relies upon Lemma 5.2.1, because Algorithm
5.3.4 is a special case of Algorithm 5.1.4 (in which Nk=N, fk=f, ~k=~'
210 ChapterS
Qk(z,dx)=Q(z,dx)) and Lemma 5.2.1 defines some fundamental properties of the latter
algorithm.
First let us prove an auxiliary assertion.
Lemma 5.3.2. Let the operator L=1(. *, defined by (5.3.2), be strictly positive, f... be the
maximal eigen-value of 1(.*, and P(dx) be the probabilistic eigen-measure corresponding
to this eigen-value. The operator U acting from n to n according to
(5.3.13)
has no non-trivial solutions belonging to C(X). In order to show this, multiply (5.3.13)
by P(dz) and then integrate with respect to 'X. If u satisfies (5.3.13), then it satisfies
1(. *u=f...u and f u(z)P(dz)=O: these relations together can be satisfied only by a function
that is identically equal to zero since, in virtue of the property mentioned in Section 5.3.1,
the non-zero eigen-function of 1(.*, corresponding to the eigen-value f..., is either strictly
positive or strictly negative. That proves the lemma.
Theorem 5.3.1. Let the conditions (a), (b), (c), (e) and (s) of Section 5.2 be satisfied;
assume that Q(z,dx)2:c21l(dx) for Il-almost all ZE X where c2>0 and the probability
measure 11 is the same as in the condition (e) of Section 5.2. Then
I) for any N=1,2, ... the random elements ak=(xl(k), ... ,XN(k), k=I,2, ... , (as
defined in Algorithm 5.3.4) constitute a homogeneous Markov chain with stationary
distribution RN (dx 1 , ... ,dxN), the random elements with this distribution being
symmetrically dependent;
2) for any E>O there exists N*2:1 such that for N2:N* the marginal distribution
Proof Consider Algorithm 5.3.4 as that for sampling a homogeneous Markov chain in
D= XN. Denote the elements ofD by
Methods o/Generations 211
_«k)
a k-
(k»)
xl "",x N '
d d
Q(a,db) = J... JF( xl'd~l) ... F( XN,d~N) x
-d -d
Note that this transition probability is Markovian, because the Markovian assumption
concerning the transition probability Q(z,dx).
Let us prove that the recurrently defined distributions
Q k l(db) = JQk(da)Q(a,db)
+ D
converge to a limit in variation for k-+oo. Indeed, this follows from (5.3.14) and the
conditions of theorem, as
N N
Q(a,db) ;::: !I .L [c /( c 1 + (N - 1)( max f + d)) ]Q( xi,dz j) ;: :
J=1 1=1
N -1
;: : F1
p Ncicici + (N - 1)(maxf + d)] ~(dz -) =c 3P:(db)
J
where
N
c 3 = [Nc 1c l( c 1 + (N -1)(max f + d))]
(obviously, O<c3<1). Now, it follows from the above said and Neveu (1964),
Supplement to Section V.3, that
212 Chapter 5
(5.3.16)
Using Lemma 5.2.1, we obtain that the random elements ak with distribution Qk(dak) are
symmetrically dependent for all k=I,2, .... Let us show that the random elements with
distribution RN (da) are symmetrically dependent as well. Assume for any.
B=(B 1,... ,BN)E:BN that
where (i 1h, ... ,iN) is an arbitrary permutation of (1 ,2, ... ,N»). Choose any two sets
BE:BN, AE :F(B). In virtue of the fact that Qk(B)=Qk(A), and for all k=I,2, ... (5.3.16) is
satisfied, we obtain that
The left hand side of the inequality is not influenced by k, and the right hand side tends to
zero for k-7 oo • Therefore, RN(B)=RN(A) for any BE:BN, AE:F(B), that is equivalent to
the symmetrical dependence of random elements with probability distribution RN(da).
Now, let us make use again of Lemma 5.2.1 with M=N, PN=RN: it follows that
R(N)(dx) is representable as
(5.3.17)
Methods of Generations 213
where ~N-70 in variation for N-7 00 with a rate of the order N-l/2.
Finally, let us consider the operator T mapping:M.xM, into:M. by
-1
T(~,R)(dx) = R(dx) - [J f(z)R(dz)] JR(dz)f(z)Q(z,dx) - ~(dx).
T is Frechet-differentiable with respect to the second argument at the point (O,P), the
derivative being TR'(O,P)=U where the operator U is defined by (5.3.12). Tn virtue of
Lemma 5.3.2, the inverse operator U-l exists and is continuous and, therefore, one can
apply the implicit function theorem to (5.3.17). This completes the proof.
If the conditions formulated in Theorem 5.3.1 are met, then one can estimate the
convergence rate of P(k,N,dx) to P(dx). Indeed, using (5.3.16) we obtain for all
k=I,2, ...
i.e. the distribution Qk(dx) converge with the rate of geometric progression for k-7 oo • On
the other hand, it follows from Lemma 5.2.1 and the implicit function theorem (see e.g.
Kantorovich and Akilov (1977) §4 of Ch.17) that var(R(NrP)~cN-l/2 where c>O is a
constant.
Thus, if the conditions of Theorem 5.3.1 are satisfied, then the distributions
P(k,N,dx) of the random elements x/k ) (j=1, ... ,N) are close (in variation) to P(dx) for
sufficiently large Nand k and, therefore, the estimator (5.3.9) is applicable to (5.3.6). In
the case of 10=1 the estimate of mean square error is readily derived:
The first term on the right side of the inequality (the systematic component) can be
estimated by means of the above estimates; the order of the second term (the random
component) is N-l, N-7 oo•
Two facts should be mentioned that follow from the above results and may prove
useful, together with the discussion of Section 5.3.2 concerning the generation method as
a global optimization algorithm.
First, if f 0 is close (in the norm of space qX» to f, then the solutions of (5.3.5)
corresponding to these functions will be close. This follows from the implicit function
theorem used in the proof of Theorem 5.3.1, Lemma 5.3.2 and the fact that Frechet-
derivative with respect to the second argument of V acting from qX)xM, to :M. according
to
214 Chapter 5
is VR'(f,P)=U in the point (f,P) where U is defined by (5.3.9). This fact justifies the use
of Algorithm 5.3.4 for optimization of estimating function f ' if the evaluations of f itself
are too expensive.
Assume now that the optimization problem is stated in terms of estimating the point x*
on the basis of a fixed (but sufficiently large) number No of evaluations of f (possibly,
with a random noise). If one applies Algorithm 5.3.4 and chooses Q (under the
assumption 'Xc:R.n) according to (5.3.7), then the following may be recommended for
choosing the algorithm parameters (3, N, and I: first, (3 is chosen so that
2
8 ((3) = var(P - e*) or
be small; then using the convergence rate with respect to N and prior information about J
(approximate number of local extrema, Lipschitz constant etc.) N is chosen; finally, I lS
taken to be the integer part of NofN.The closeness of the distribution ofrandom vectors
x/I) obtained at the last step of Algorithm 5.3.4 to the distribution e*(dx) can be estimated
using the estimates of the convergence rate and 82«(3), whose value can be estimated by
means of the results of Section 5.3.2. Indeed, one has
(5.3.18)
On the basis of this inequality, one can formulate the problem of optimal choice of (3, N
and I, as determination of the minimum of the right part of (5.3.18) under the constraint
NISN o. The numerical solution of this problem encounters significant computational
difficulties, due to the lack of or incomplete knowledge of the constants involved in the
estimate.
The investigation of the generation method, as described in this section, is based on
the apparatus for analysing global search algorithms. That is why the results obtained are
of fairly general character (in the sense of the techniques used), but need somewhat
specific assumptions that are natural in constructing global random search algorithms. Let
us remark that Mikhailov (1966), Khairullin (1980) and Kashtanov (1987) studied the
convergence rate of the generation method as presented in the form of Algorithms 5.3.3
and 5.3.4. The convergence rate estimate with respect to N was proved to have the form
o(N-I), N -+00, under assumptions that slightly differ from the above ones and are,
generally speaking, more natural for this problem. The approach described is,
nevertheless, still sensible because (i) it enables one to detect a number of qualitative
features of eigen-measure behaviour and (ii) it may be used, in virtue of its generality, for
the investigation of algorithms differing from the generation method (e.g. for sequential
algorithms described in the following section).
Methods a/Generations 215
The algorithms of this section are modifications of those described in Sections 5.2, 5.3:
the basic difference is in the possibility of using the points obtained at earlier iterations -
rather than only those obtained at the preceding one - for determination of the subsequent
points.
Algorithm 5.4.1.
1. Sample N 1 times a probability distribution PI (dx), obtain x 1o ... ,XN l' Evaluate
~'(xi) at these points where y(x)=f (x)+~(x);:::.o. If
Nl
LY(X.)=O,
i=l 1
4. Obtain a point xk+ 1 by sampling the distribution Pk+ 1(dx), evaluate y(xk+ 1).
5. If k~I, then the algorithm is terminated. Otherwise return to Step 3, substituting
k+l for k.
Similarly to the study of Algorithm 5.3.4, let us consider the asymptotic behavior of the
unconditional distributions P(k,dx) corresponding to Pk(dx). Note that P(k,dx)=Pl (dx)
for k~Nl'
U sing the symbols of the assumption (g) of Section 5.2, the distributions P(k+ 1,dx)
for k;:::.N 1 can be represented as
k
P(k + l,dx) = Jk n(d~\)a(eJ i LA(z
=1
.,~ .,dx)
1 1
(5.4.1)
Z
where Rk(dx 1, ... ,dxk) is the joint probability distribution of the random elements
xlo ... ,xk and
216 Chapter 5
Rl'X, ... ;X.,dx j' 'X, ... ,'X) = P(i,dx) for i = 1, .. ,k.
Theorem 5.4.1. Let the conditions (a), (b), (c), (e) and (s) of Section 5.2 be satisfied
and the operator]G * be strictly positive. Then the distributions P(k,dx) defined by (5.4.1)
weakly converge for k~oo and N 1~oo to the eigen-measure of]G, P(dx) corresponds to
the maximal eigen-value A.
Proof. The distributions P(k,dx) converge in variation for k~oo to some probability
distribution S(dx): indeed, the following takes place for any m~l, k~N 1+m:
k+m-l k+m-l
j1]k(1- p i)P k(dx) + i2J+/ 1 - Pi)PkQ(xk,dx) + ... + Pk+m-lQ(xk+m-l'dx),
for k~oo, i.e. the sequence of distributions {P(k,.)} is fundamental in variation. Let us
show that the limit S(dx) of the sequence coincides with P(dx). For any AE:B we obtain
i i=1I f TI(BJ[ka(BJ]A(zi'Si'A).
k
S(A)= lim P(k + 1,A) = lim
k~oo k~oo k
Z
The assertion will be proved, if one can show that for any i=1,2,oo.,k 0i ,k~O is valid (in
variation, for k~oo), where OJ ,kEn is defined by
But this fact is proved by almost literally repeating the second part of the proof of Lemma
5.2.1, considering the fact that, for uniformly bounded sequences of random variables,
convergence in probability is equivalent to convergence in mean. The theorem is proved.
Algorithm 5.4.2.
k
p.=y(x.)! l, y(x.);
I I j=k-N J
Algorithm 5.4.3.
Perform the same operations as in Algorithm 5.4.2, but for k>N disregard the point
where f has the least value of all the points included into the set of N points rather than the
point Xi, i=k-N.
The question of convergence in this case, for any set of parameters N 10 N is solved in
a simple manner: if f is continuous in the vicinity of at least one of its global maximizers
and if Q(X,B(Z,E))2:0(E»0 for all X,ZE 'X and 10>0, then the sequence {f (xk)} converges
in probability to f(x*) for k~oo.
As opposed to the parameters N 1and N of Algorithm 5.4.2, their counterparts in
Algorithm 5.4.3 may be chosen in an arbitrary manner. If it is a priori improbable that f
reaches a local maximum far from the global one, having a value near to max f, then even
N 1=1 and N=l become acceptable. The resulting algorithm becomes Markovian (see
Section 3.3), converges under the above conditions, and the limiting distribution of points
xk is concentrated on the set of global maximizers.
Algorithm 5.4.3 is rather similar to the well-known controlled random search
procedure of Price: its essence is as follows (for more details, see Price (1983, 1987)).
At the k-th iteration (k2:N>n), n+1 distinct points zl, ... ,zn+l are chosen from the
collection Zk consisting ofN points in store; these points define a simplex in :R.n. Here zl
has the greatest objective function value evaluated so far, and the other n points are
218 Chapter 5
randomly drawn from the remaining N-l points ofZN. A point xk+l is determined as the
image point of the simplex's pole zn+ 1 with respect to the centroid z of the other n points,
i.e. xk+ 1=2 z-zn+ 1. If xk+ 1e X, then the operation is repeated. If
then let Zk+ 1=Zk, otherwise xk+ 1 is included into Zk+ 1 instead of the point from Zk with
the least objective function value. From the abovesaid it follows that Algorithm 5.4.3 and
Price's algorithm differ only in the way of choosing the next trial points.
One may encounter various difficulties while applying the above developed technique to
specific problem-classes. In this chapter we shall discuss the ways of overcoming some
difficulties arising in constrained, infinite dimensional, discrete and multicriterial
optimization problems.
Let X be a Borel subset ofRn, n~l: (X is the parameter space), B be the a-algebra of
Borel subsets of X, Iln(X)>O, CV be a continuously differentiable mapping of X into Rk
(k~n), By be the a-algebra of Borel subsets of the set Y=CV(X), x=(xlo ... ,xn),
y=(yt, ... ,Yk), cv=(CPl, ... ,CPk)· With this notation, y=cv(x) means that
rYl = CPl(Xl""'X n )
1; ~'~'~'i~~::: ::~'~').
For any XE X, set
where the determinant under the root sign is always non-negative, in virtue of non-
negative definiteness of the matrix II dij(X) II. The following relation
219
220 Chapter 6
is known to be valid (see Schwartz (1967), § 10 of ChA) for any :By-measurable function
f defmed on Y and any set B from the set {B: B=<I>(A), AE:B} where ds is the surface
measure on Y=<I>('X). Hence, for any measurable non-negative function f defined on Y
and satisfying the condition
jf(<I>(x»D(x)Jln(dx)= 1,
induces the distribution f(s)ds on the manifold Y=<I>('X). In the important particular case
of
c = jD(x)Jln(dx) < 00
the distribution
i) (i))
<l>i = ( CPl ,···,CPk
where
Lemma 6.1.1.
1) Let Y 1=Y 2=Y, H be CLdiffeomorphism of 'X 1 on 'X2 such that «I> 1=«1>2 • H, f be
a :By-measurable function, f~O,
Jf(s)ds = 1.
y
Then the distributions f(<<I>i))Di(x)j.l'n(dx) on 'Xi (i=1,2) induce the same distribution
J(s)ds on Y.
2) Let 'Xl ='X2='X, «1>1 =c«l>2+b where c is a constant and b is a constant vector. Then
D1 (x)= Ic I nD2(x) for all XE 'X.
3) If «I> is linear with respect to each coordinate, then D(x)=const.
4) Let 'Xl ='X2='X, «1>1 (x)=B«I>2(x), where B is an orthogonal (kxk)-matrix (i.e.
BB'=Ik). Then D1(x)=D2(x) for all XE'X.
5) If n=k, then D(x)= I a «1>/ ax I is the Jacobian of the transformation «1>.
6) If k=n+1, q>j<x)=Xj G=1,2, ... ,n), then
V2
Proof. The first assertion readily follows from Theorem 106 of Schwartz (1967). The
second, third and sixth statements are verified by direct calculation of the determinant. The
fifth follows from the fact that
(i = 1. ... ,k)
where
222 Chapter 6
~b kj b k~ = 0 . ~ = {O1 if j *~
ko
t=1 J
if j =t
Let us show that
1 k [k acp~2) k acp(2)]
d()= b _J_ b _t_=
."
1<
L.L=1
m=1
mJ' ax.1 L
t= 1
mt ax
~
The above results enable us in some cases to simplify the distribution sampling on
manifolds. For instance, it follows from the fourth assertion that a uniform distribution on
an n-surface may be defined to within an orthogonal transformation. Let us remark finally
that the validity of statements 2 - 6 of Lemma 6.1.1 follow the diffeomorphism theorem,
see Schwartz (1967).
The relation between a distribution on the set X having a non-zero Lebesgue measure and
on the manifold Y=<I>(X) enables the reduction of sampling on Y to that on X which is
usually much simpler and can be solved by standard methods (such as the inversion
formula, the acceptance-rejection method or other procedures described in Ermakov
(1975) or in Devroye (1986)). Some methods of sampling complicated distributions on Y
can be used directly on manifold Y without any changes (which, of course, corresponds
to applying the method on Y). For convenience of references, let us briefly describe the
acceptance-rejection method on Y.
Let two distributions be defined
P . ( ds)
1
= cp 1. ( s)ds (i = 1,2),
The acceptance-rejection method consists in sequential sampling of pairs of independent
realizations {~j>aj} of a random vector with distribution PI (ds) and a random variable
with the uniform distribution on [0,1], until the inequality g(~j)::;aj is observed. The latter
Random Search Algorithms 223
I. a~y? =
i=l
I I
I} where a. > 0
I
(i = 1, ... ,k).
By virtue of the second and fourth assertions of Lemma 6.1.1, the uniform distribution on
Y is uniform on the original ellipsoid. Therefore, it suffices to sample only the first of
these distributions.
224 Chapter 6
corresponds to the uniform distribution on Y+, where 2/q is the volume of ellipsoid Y
and
The representation <1>1 =<1>4°<1>3°<1>2 is valid, where <1>2 is the mapping of 'X into
as defined by
+ +
<I> 4: S (k) ---t Y ,
1/2
F(ds) = c 3( I a~s~)
i=1 1 1
ds,
Sampling the above four distributions (on Y+, S(k) +, 'X and B(n» is equivalent. Most
naturally, F(ds) on S(k) + is to be sampled by means of the acceptance-rejection method,
where P2(ds)=F(ds), PI (ds)=qds, and q=2n:- k/ 2['(k/2+ 1) with 2/q standing for the
volume of unit sphere S(k). Sampling PI (ds) is well-known, see e.g. Zhigljavsky (1985),
Devroye (1986).
Algorithms for sampling non-uniform distributions on an ellipsoid may be constructed
in a similar manner.
Y={YERk: fb.y?=I, b.<O (i=I, ... ,m), bJ.>O o=m+l, ... ,k)}.
i=1 1 1 1
Set n=k-l,
'X = {x ERn: .
i b. x?:5: I},
1=1
1 1
<1>: 'X ~
+
Y ,
1/2)(
=( nlbJ
n m )(n-m)!2
D 2(z) 1+ .I.z~
1=1 J=1 J
on Xl corresponds to Pl(dx) on X.
Thus, in order to determine a realization ~ of a random vector with distribution F(ds)
on Y+, one has to obtain a realization ~ of a random vector with distribution P2(dz) on
Xl, and to take ~=cD(cD2(~». Sampling distributions on a cylinder Xl does not encounter
any serious difficulty.
Random Search Algorithms 227
In this case
1/2
D(x) = (1 + 4i b~X~)
i=l I 1
where bj>O (for i=l, ... ,m), bj<O (for j=m+1, ... ,k-1), bk=-l, k!2~m<k.
Set n=k-1,
+
<1>: X -t Y ,
In this case
1/2
D(x) = [1 + (i b~X~J/ i
i= 1 I i i =1
b .x~]I
1
228 Chapter 6
on X corresponds to F(ds) on Y+. If m=k-l, then X=:R.n and the sampling P(dx)
presents no basic difficulties. Now let m<k:-l. Assume that
is an unbounded cylinder, <1>1: X~Xl> z=<1>(x): Zi=bi l!2xi for i=I, ... ,m,
z·=lb.1
1/2
x.
(mLb.x~ )-112 for j = m + 1, .•• ,n.
J J J i=1 1 1
-1/2
x = <1> 2(z): x. =b. z. for i= 1, ... ,m,
1 1 1
x .=
J
Ib J·1-l/2z.J(mi=l
L z?
)112
1
for j = m + 1, ... , n.
Optimization in functional spaces means that the set 'X belongs to a functional space, and
f' is a set of functionals f: 'X~R. For example, numerous problems of mechanics and
control are reducible to such optimization. As usual, the consideration of problem specific
features enables one to develop specific and fairly efficient solution methods: e.g. the
carefully elaborated calculus of variations is usually employed for optimization of integral
functionals dependent on an unknown function and its derivatives.
Attempts to apply general numerical methods to optimization in functional spaces
usually do not meet with basic difficulties cf. e.g. Vasil'ev (1981). Formally, many of the
random (including global) search methods also can be used for functional optimization,
although this gives rise to some specific problems related to the distribution sampling in
functional spaces.
Two major problems - uniform random choice of functions from 'X and choice of a
function close to a given one - occur in distribution sampling which corresponds to
sampling of stochastic processes or fields. The way of uniform choice process
organization should be completely dependent on 'X: if 'X is a subset of a space of the
C[a,b] type, the Wiener measure may be chosen as the uniform measure in 'X; if there is
some prior information about smoothness of the functions in 'X, one of the Gaussian
measures whose trajectories have the desired smoothness should be chosen instead of the
Wiener measure. Gaussian measures are preferable because they easily let themselves to
theoretical study and there exist quite a few algorithms for their sampling.
Sampling a random function close to a given one is equivalent to sampling a random
function close to zero: for solving this problem, one can also employ the methods of
sampling Gaussian measures in a special manner, but the following two methods provide
a more convenient way to parametrization of this problem under the assumption that 'X is
a subset of one-dimensional functions. The first method is based on the fact that any
Gaussian process is representable as
where Ai and CPi(t) (i=I,2, ... ) are the eigen-values and corresponding orthononnalized
eigen-functions of the correlation operator of the process, and ~1o~2"" are independent
nonnally distributed random variables with zero mean and unit variance. Sampling is
defined by a finite number defining the number of tenns in the decomposition (6.2.1) or a
distribution on the set of these numbers, by fixing a basic set ( CPi(t)} or several sets
among which one is randomly chosen each time, and by fixing small values of Al ,A2, ...
or defining a corresponding distribution on the set of these numbers. Now the desired
realizations of a close-to-zero Gaussian process are obtained through sampling the random
variables ~1'~2"" and all the specified distributions and substituting them into (6.2.1).
The second method of sampling a close-to-zero random parametrically defined function
consists in the preliminary reduction of a given class of random functions z(t) to the class
of parametrically defined z(t,S), SE e, functions with subsequent sampling of parameters
SE e. As for the parametrization, it is natural to take it as non-linear, since this provides a
great variety of fonns and profiles of the curves z(t,S) with a small number of unknown
parameters.
For defining on the parameter set a distribution to be sampled, the only point to
mention is that the quasi-unifonn distributions (see Kolesnikova et al. (1987)) defined by
the condition of equal probability that the cross-section z(t o,S) of z(t,S) passes through
any point of a given interval [zl,z2]:
'Xn{x(t) = x(t,9)},
(6.2.2)
where epo (i=I,2,3,4) are unknown parameters. Functions of the form of (6.2.2) are
well known to approximate with a high accuracy any function with the above properties.
Other methods of parametrization are in existence as well.
In complicated practical problems, one often comes across a situation where points of 'X
(called admissible solutions) are to be compared by multiple criteria rather than by a single
one, i.e. the preformance of the decision variants is evaluated by vector functions
F=(f 10···, f m)', f i: 'X~R (i=I, ... ,m). A vector y from Y=F('X)cRm will be called an
estimate, and an estimate y* from Y such that there is no ye Y for which y"'y* and ys:y*
(i.e. the inequality.$. holds for each component) will be referred to as an admissible
estimate. The set of admissible estimates is called the Pareto set, and the corresponding set
of feasible solutions is called the effective (Pareto-optimal) solution set.
Although a large portion of the literature dealing with multicriterial optimization is
devoted to the analysis of the Pareto set P and similar subsets of Y, for practical purposes
the description of the effective solution set €c'X for given criteria F is of primary
importance. In practice, one can obtain this description only by forming a fairly
representative sample of € and then approximating it (e.g. by piecewise linear or
piecewise quadratic approximation): considerations below are given to the generation of
such a sample.
232 Chapter 6
Some points of the set £ may be obtained solving minimization problem of scalar
trade-off criteria
(consideration may be given also to other trade-off criteria sets); note that all the above
points constitute the entire set £, if all the scalar criteria fi (i=I, ... ,m) are convex.
Although the individual minimizers of fA,(x) are not sufficient for describing the whole set
£, they are usually sufficient for obtaining a representative sample from £.
Theoretically, the minimization of fA,(x) under various parameters A, is the simplest
way of determining points from £. This way, however, may be inefficient, because of the
difficulties related to the solution of the single-criterion problems: indeed, for non-convex
(but, possibly, uniextremal) criteria f i (i=1 ,... ,m), the trade-off criteria fA.(x) are (already)
multiextremal. Moreover, small variations of A, can result in abrupt changes of the global
minimizer location. We indicate a number of approaches that are based on ideas of random
search, readily yield themselves to algorithmization and programming and might prove
useful in solving the problem of describing £.
First, it is natural to take such values A, that are independent realizations of a random
vector uniformly distributed on Sm.
Second, the search of minima of fA,(x) should be carried out for all functions in a
simultaneous manner, all the points obtained being tested for feasibility and rejected, if
found inadmissible.
Third, local random search is possible in various versions of the algorithm below if £
is a priori known to be connected. At the first iteration, one or more points of £ are
determined as the minimizers of criteria fA.(x); having several points of £, construct at the
k-th iteration Ak - the linear hull of points - and then determine in a random fashion
several new points in 'X near and far from Ak, compute F in these points, test the
points'membership to £, and pass to the next iteration.
Fourth, the following natural approach (see Sobol and Statnikov (1981» may be used
for numerical construction of £: choose in 'X a grid EN with good uniform characteristics
(e.g. a nt-grid), compute F in the grid points, then construct (in a finite number of
comparisons) the set of effective points on EN that is an approximation of £, for large N.
Substitution of the set 'X by a finite number of points selected in a special manner is the
essence of this approach.
Fifth, using the results of Chapter 4 and 7, one can formulate random search
algorithms, where the prospectiveness of uniformly chosen points and corresponding
subsets of 'X is determined in a probabilistic, rather than deterministic manner - after
Random Search Algorithms 233
determination of confidence intervals for the values of fA,(x) minima, under randomly
chosen A,E Sm.
The following algorithm is simple, but it reflects the principal features discussed.
Begin with determining N independent realizations Xl ,... ,xN of the random vector
uniformly distributed in X; compute next the vector function F in these points and take ~
independent values A,(1)""'A,( 0 of the random vector with uniform distribulion on Sm.
Perform for all i=l, ... ,~ the following procedure: choose that point xi* of the obtained
ones where
f A. (x)
(i)
is maximal; construct a ball B(xi*,e) with the centre in xj* and radius 10 selected so that
only a small portion of the points Xlo ... ,XN belong to the ball; using (4.2.25) construct
the confidence interval of a fixed leve11-y for
min f A. (x)
XE'X\B( X~,E) (i)
fA. (xP
(i)
falls into the confidence interval, then construct in a similar manner in the set X\B(xi* ,e)
the ball of the same radius until
falls into the last confidence interval. The union of all the balls constructed is regarded as
an approximation to the set of efficient points. This union can be considered as the
truncated search domain on which the same operation can be performed, but with smaller
e. For N, ~ ~oo and natural regularity (say, convexity) conditions on f i (i= 1,... ,m), one
can directly prove that the probability of missing global minimizer of a randomly chosen
f A(x) tends to a value not exceeding y.
then may be chosen in the course of the search being, by the following heuristic
considerations. For optimization of a function f, the pseudo-metric
contains the points of X, where f (z) differs from f (x) by £ at most. The objective of
minimizing the Lipschitz constant leads to the following way of choosing a metric, best
fitted to a given function I, from a given set n={ P 1,... ,p I.} of metrics or pseudo-metrics:
fix a number k o ; normalize the metrics Pi so that the number of points in all the balls
B(x,£,Pi) of a fixed radius £ (say, £=1 ) approximately equals ko; estimate the Lipschitz
constant of f by (2.2.33) for each PiE n and use for the optimization that metric for which
the Lipschitz constant estimate is minimal.
The set n might contain metrics that are standard for the sets under consideration and,
if possible, pseudometrics Pg for functions g that are close in some sense to f (e.g.
estimates of f).
Let us demonstrate how the methods of branch and probability bounds may be used
for optimizing discrete functions. The only basic difficulty lies in the fact that the order
statistics apparatus cannot be formally applied. Indeed, since there is a positive probability
of getting on the bound (i.e. a global optimizer of f),there is no sense to consider
conditions like (a) of Section 4.2, as the corresponding limit simply does not exist.
Moreover, the order staistics of discrete distributions do not form a Markov chain, see
Nagaraja (1982). But for a very great number m of points of X (only this case is of
practical interest), one may assume that the probability of getting exactly to the bound is
negligible and that the discrete c.d.f. F(t)=Pr{ x: f (x)<t) is approximated, to a high
accuracy, by a continuous c.d.f. for which one can apply extreme order statistics theory
and, therefore, the apparatus of Sections 4.2 - 4.4 and of Chapter 7. This way, for very
large m the accuracy of statistical inference made under the continuity assumption
diminishes only insignificantly. Such an approach was applied in Dannenbring (1977),
Golden and Alt (1979), Zanakis and Evans (1981) (but, of course, these works do not use
the more up-to-date statistical apparatus described in Chapters 4, 7).
Now we shall tum to the study of relative efficiency of a discrete random search
following Ustyuzhaninov (1983) in the formulation of the problem.
Random Search Algorithms 235
Below a problem-type is considered, for which an exact result on the relative efficiency of
random search can be obtained.
Let 'X and Y be finite sets, 'X consists of m elements, f: 'X~ Y be an algorithmically
given function. A non-empty point set M(/) is associated with f, e.g. M(f) is a set of its
global minimizers. It is required to find a point XE M(n through sequential evaluations of
f. Function f is known to belong to a finite set 'F consisting of ~ functions. Thus, a table
with ~ rows and m columns is given in such a way that a function f E 'F corresponds to a
row and a point XE'X to a column: the value f (x) is the intersection of row f and column
X.
Consider now the scheme of random search algorithms solving the given problem.
An algorithm involves two stages. The first stage contains not more than sea)
iterations. At each k-th iteration either the transition to the second stage is established
(perhaps, at random) or a point of evaluating f is chosen, according to a (conditional)
probability distribution
At the second stage a point x is chosen that is thought to belong to M(/): herein a mistake
is possible.
An algorithm is called deterministic, if all indicated probability distributions are
degenerate (i.e. all decisions are deterministic), otherwise it is called probabilistic.
Define the problem Tt as a pair Tt=(F,n) consisting of a class of functions 'F and a
family of sets n={M(f), fE'F}. A deterministic algorithm d is called applicable to a
problem Tt if for any function fE 'F, it gives an element from M(f) without mistake.
Denote the class of all such algorithms as D(Tt). The maximal number N(d) of evaluations
of f needed using an algorithm d is called the problem hardness with respect to d. The
problem complexity is defined as
N D = min N(d).
dE~1t)
Let p(M(f) If,r) be the probability of that the application of an algorithm r to a function f
yields the correct solution. We shall call an algorithm r p-solves a problem Tt if
p(M(f) If,r)2:p for any fE'F. Denote by Rp(Tt) the class ofp-solving algorithms for a
problem 1t and set
for the index yof a full problem 1t. The inequality (6.2.3), however, is wrong. In order to
show this, it suffices to let p tend to zero on the right hand side of (6.2.3):
.
p~0
( log pm
- log (1 _ p) +
1)
p =- 00
thus the minimum of <p is reached at the point Yo' In the inequality (6.2.4) set
v=[y+ l/logq], express this quantity as v=y+ l/logq-cr (where OScr<l), substitute it into
(6.2.4) obtaining NDsmq'Yy/x(cr), where
The function X decreases on the interval [0,1), since X'(O)=O and X'(cr)<O for cr>O.
Therefore X(cr»X(l) for Oscr<1, whence NDsmq'Y 'YIX(l). Using this inequality and the
relation YSND~ one obtains 'YSK(q,m) where
If voSO, i.e. 'YS-1/log q, then the inequality (6.2.4) is equivalent to NDsm: this way
NDsm-1 for msq(1-1/logq). If q<l/ ~, then the restrictions on m are absent. Summing up
the above, the indices 'Y of full problems 1tm satisfy the following conditions:
a) 'Ysm-1 for q>l/e, msq(1-1/logq),
b) -l/log qS'YSK(q,m) for q>l/e, m>q(1-1/logq),
c) OS'YSK(q,m) for qs1/e
where K(q,m) is determined by (6.2.5).
As it follows from condition b), the maximal acceleration due to using random search
is s+l/s-l, and it is attained for q=exp(s/(s-l), where s is the solution of the equation
ms=exp( -s).
PART 3. AUXILIARY RESULTS
This chapter describes and studies statistical procedures having a significant place in the
theory of global random search (these procedures are included into some of the methods
of Chapter 4). Most attention is paid to linear statistical procedures that are simple to
realize.
Section 7.1 states the problems and considers the case, when the value of the tail index
of the extreme value distribution is known. This is a typical situation in global random
search theory, as follows by the results of Section 4.2.6.
Section 7.2 deals with the case, when the value of the tail index is unknown. Finally,
Section 7.3 investigates the asymptotic normality and optimality of the best linear
estimates.
7.1 Statistical inference when the tail index of the extreme value
distribution is known
of y be a.s. finite and consider statistical inference for M throughout the chapter. Statistical
inference for the lower bound L=vrai inf y under the supposition L>-oo are constructed
similarly or can be elementarily obtained from results related to M and hencc will not be
considered here.
Various approaches can be used for constructing statistical inference. In particular the
parametric approach is based on the supposition that an anlytical form of the c.d.f. F(t) is
accepted (identified), but some of its parameters are unknown being estimated from the
sample. This approach is of moderate interest in the context of global random search
theory and is not considered below.
The yearly maximum approach, thoroughly described by Gumbel (1958) and studied
in some works, the most valuable of which is Cohen (1986), involves the partition of the
sample Y of size N=h into r equal subsamples, and the estimation of the extreme value
distribution parameters as if the maximal elements of the subsamples have this
distribution, is generally inefficient as well. After realizing this inefficiency, many works
have been devoted to the problem: among them the work of Robson and Whitlock (1964)
was the first and most of them became known in the 1980's. These works (including
those of the present author) construct and study statistical inference about M, based on
using some k+ 1 elements of the maximal order statistics
239
240 Chapter 7
(7.1.2)
from the set H={Tll, ... ,l1N} of the order statistics derived from the sample Y, rather than
using the whole sample Y. The following arguments may be put in favour of this
approach: (i) according to a heuristic reasoning, the order statistics not belonging to
(7.1.2) are/ar enough from M and so not carry much valuable information concerning M,
(ii) the theoretical considerations presented below as well as the corresponding numerical
results show that if k is sufficiently large, then further increase of k (under N ~oo) may
lead to either an insignificant improvement or even the deterioration of the statistical
procedure precision (due to the inaccuracy of computations), (iii) using (7.1.2), the
asymptotic theory of extreme order statistics can be applied to construct and investigate the
decision procedures. We shall confine ourselves to this approach being a semiparametric
one (at present not seeing any satisfactory alternative).
The following classical result from the theory of extreme order statistics (see, for
instance, Galambos (1978» is essential for the theory presented later.
lim V(tv)/V(v) = t- a
V---) 00
(7.1. 3)
for z < 0
(7.1.5)
for z ~O
The asymptotic relation (7.1.3) implies that (under N~oo) the sequence of random
variables (l1N-M)/(M-eN) converges in distribution to the random variable with the c.d.f.
(7.1.5) which is called the extreme value c.d.f. and together with'lloo(z)=exp{ -exp(-z)} is
Bounds of Random Variables 241
the only nondegenerate limit of the c.d.f. sequences for (llN+aN)/bN (where {aN} and
{bN} are arbitrary numerical sequences).
The parameter ex of the c.d.f. (7.1.5) is called the tail index (or the shape parameter) of
the extreme value distribution. This section deals with the case where condition a) of
Theorem 7.1.1 holds and the value of the parameter ex is known and Section 7.2 will treat
the case of unknown ex.
(7.1.6)
where ~o,~l, ... ,~i are mutually independent exponential random variables with the
density e- x , x;:::O. The second is the asymptotic relation
-1 l/a
M-F (x)-(M-8 N )(-Nlogx) , N ~oo, (7.1.7)
due to Cook (1979), being a simple consequence of (7.1.3). Here the notation aN-bN
(N ~oo) means that the limit values
and limb
N~oo N
exist and are equal. (Here the convergence in distribution is considered, if {aN} and {bN}
are sequences of random variables.)
The next statement immediately follows from (7.1.6) and (7.1.7).
Lemma 7.1.1. If a) holds then for N~oo, i/N~O the asymptotic equality
(7.1.8)
x> O. (7.1.9)
Lemma 7.1.2. Let assumption a) hold, a>l, N~oo, i2/N~O (for instance, i may be
constant). Then
Proof. It is well known (cf. e.g. Galambos (1978» that the density of the order statistic
llN-i is
where cp(x) is the density of the c.d.f. F(x). Then it follows that
where
. 1 .
11 = NC~_1 Jx N- i - 1(1_ x)ldx = 1.
o
i 1 lin. i
12 = NC N _1 J(- Nlog x) x N- 1- 1(l_ x) dx =
o
12 = 1.
N! N
Va
'I(N-'-l)1
-
Y e
Je -
i
1/ a - Ny( y 1) d
y.
1 . 0
we obtain
12
(N-l)!
= i!(N-i-l)! OX
J- lIa e -xC e x/N - )
1 dx-
i
_ (N-l)! J-
. x
i+l/a
e
-x d - h b
x- ' N '
i!(N-i-lY,N 1 0 1, 1
where
N-l N-2 N-i
hj,N= ~"N"""" ... ~ ~1 (7.1.12)
for N~oo, i2/N ~O. Substituting the derived expressions for 11 and 12 we obtain
(7.1.10): the lemma is proved.
Note that results on the speed of convergence to the extreme value distribution are
contained in §2.1O of Galambos (1978), Smith (1982), Falk (1983): the results mentioned
show that in the case a~1 one has
instead of (7.1.10). This case is of minor interest and will not be considered.
Lemma 7.1.3. Let assumption a) hold. Then for a~l, N~oo, i21N~0, we have
(7.1.13)
N-i-l i-j-l j
PN-i,N-/X, y) = AF (x)<p(x)(F(y) - F(x») <p(y)(1- F(y»)
Changing the variables similarly to those in the proof of Lemma 7.1.2, using the
asymptotic expressions (7.1.7), (7.1.11) and (7.1.12), introducing the notations
2
B = (M - eN) /(j! (i - j - I)!),
and integrating by parts at the end of the proof we obtain the chain of relations
1
f
00 00 • •
j+lIa
-h. ~fdv v ulIa
e -u( u-v )I-J- du-
I, 0 V
o0
foo 1/a i-j-l 2/a+i {
= Bf (1 + y) Y v exp - v(y + 1)}dvdy =
OO
o0
The asymptotic equalities (7.1.10) and (7.1.13) were formulated by Cook (1979),
without proof and applying the inexact assumption i/N~O instead ofi2/N~0 for N~oo.
The inadequacy of the condition i/N~O (N~oo) follows by the fact that in this case
(7.1.12) does not hold. Indeed, ifN~oo, i/N~O, but i2/N~O not necessarily holds, we
have
h.
I,
N = IT(1 - j/N) = exp {
j=1
~log (1- j/N)}-
j=1
instead of (7.1.12). So ifi2/N does not approach zero while N~oo then
lim h. N~1
N~oo I,
and the asymptotic equalities (7.1.10) and (7.1.13) are not valid. Instead of them the
asymptotic representations of the following corollary hold.
Corollary 7.1.1. Let assumption (a) hold, a>l, N~oo, i?j, i/N~O. Then
246 Chapter 7
We shall omit further the multiplicator exp{-i 2/(2N)} or exp{-I/2} supposing i2/N for
N ~oo. Modifications of the statements given below for the more general case, when
i2/N~l<oo while N~oo, are obvious.
7.1.3 Estimation of M
We suppose again that condition a) holds and the value of the tail index ex>l is known.
Under these suppositions we consider below various estimates of the maximum (essential
supremum) M=vrai sup y of a random variable y. The estimates use the k+ 1 upper order
statistics (7.1.2) corresponding to an independent sample ofy.
The most well-known estimates of M are linear, having the form
(7.1.15)
Lemma 7.1.2 states that if a) holds, ex> 1, N ~oo and k2/N ~O, then
k k
EM N,k= i~oaiETlN-i = Mi~ai - (M- 0N)a'b +o(M - ON) (7.1. 16)
b.
1
= r(i + 1 + l/ex)/r(i + 1).
Since the c.d.f. F(t) is continuous, thus M.,0N and M-ON~O for N~oo. Using now
(7.1.16), the finiteness of the variances of TlN-i for i=l, ... ,k and the Chebyshev
inequality we obtain the following statement.
Bounds of Random Variables 247
Proposition 7.1.1. Let assumption a) hold, N-7 oo , k2/N -70.Then the estimate
(7.1.15) is consistent if and only if the equality
k
a'A = La. = 1 (7.1.17)
i=O 1
Proposition 7.1.2. Let assumption a) hold, a>l, N-7 oo , k 2 /N-70. Then for
consistent linear estimates MN,k of the form (7.1.15), the asymptotic expressions
are valid, where A is the symmetrical matrix of order (k+l)x(k+l) with elements Aij
defined for i~ by (7.1.14).
We refer to (7.1.17) as the consistency condition and to
as the unbiasedness requirement. Certainly, if (7.1.20) holds, then the estimate (7.1.15)
still remains biased, but for a>l its bias has the order O(N-l), as N-7oo•
Choose the right hand side of (7.1.19), as the optimality criterion for consistent linear
estimates (7.1.15) in the case a>1. The optimal consistent estimate MN,k * and the
optimal consistent unbiased estimate MN k + (the word consistent will be dropped) are
determined by the vectors '
(7.1.21)
248 Chapter 7
(7.1.22)
(7.1.23)
These expressions are easily derived viz. introducing Lagrange multipliers. (7.1.21) and
(7.1.22) are due to Cook (1980) and Hall (1982), respectively.
If k is not small enough, then the vectors (7.1.21) and (7.1.22) are hard to calculate,
since the determinant of the matrix A=Ak is almost zero. Namely, the following statement
holds.
holds.
The proposition above will be proved in Section 7.3.2. Fortunately, simple
expressions for the components of the vectors (7.1.21) and (7.1.22) can be derived (see
Section 7.3.1). Using these, the components of a*=(ao*, ... ,ak*)' and a+= (ao +, ... ,ak+)'
can be easily calculated for any <DO and k=1,2, .... The following tables present them for
a=2,5,10 and k=2,4,6.
It is also proved in Section 7.3 that the optimal linear estimates are asymptotically
Gaussian and efficient. In particular, the following result holds (as being included in
Theorem 7.3.1).
Let the (somewhat stricter than a) ) condition
t~M, (7.1.25)
hold where 2~a<oo and Co is a positive number, N~oo, k~oo, k2/N~0. Then the
asymptotic normality relation
is valid where
if a> 2
if a =2
is the asymptotic mean square error of MN ,k * (i.e. E(MN ,k *-M)-O'N ,k 2 for N~oo,
k~oo, k2/N~0).
An analogous result is valid for the estimates MN k + and for related others. Thus
Csorgo and Mason (1989) show it for linear estimates determined by the vectors a with
components
250 Chapter 7
v.
1
for a> 2, i=O, ... ,k- 1
v k +2-a for a > 2, i = k
2/log(k + 1) for a = 2, i = 0 (7.1.27)
(log (1 + 1Ii»/log(k + 1) for a = 2,1 ::;; i ::;; k - 1
(logO + 11k) - 2)/log(k + 1) for a=2, i=k
where
Hall (1982) does the same for the maximum likelihood estimates that are determined by
(4.2.22) and (4.2.23).
For practical use, the very simple estimate
and so
where 'JI(.) =r'(.)tr(.) is the psi-function and -'JI(1)::::0.5772 is the Euler constant.
Proposition 7.1.4. Let (7.1.25) hold, N~oo, k~oo, k2/N~0. Then the estimate
(7.1.28)
(7.1.29)
for N--+oo and each iSk, where ~0'~1> ... are independent and exponentially distributed
with the density e- x , x>O. Therefore
k --+00.
Analogously
3 2 1 ooJ k-2 2 (2 ) 2 2
e dt-kc o = k /(k-1)-k co-co'
-t
-k cOk! t k --+00.
o
The consistency of (7.1.28) immediately follows from (7.1.29) and the Chebyshev
inequality. The proposition is proved.
An alternative way of estimating Co is due to Hall (1982) and consisting in setting
(7.1.30)
where
"M
is an estimate for M. If
"M
is the maximum likelihood estimate (MLE) for M, then (7.1.30) is the MLE for co.
The above approach to construct confidence intervals for M can be used only, ifN is
so large that k also can be chosen large enough. For moderate values of N, this approach
is not suitable and another one, due to Cook (1979), Watt (1980) and Weissman (1981)
can be recommended. It is based on the following statement which we prove since the
above references do not contain the proof.
252 Chapter 7
Lemma 7.1.4. Let assumption a) hold, N~oo, k be fixed. Then the sequence of
random variables
ak
F k(u) =1 - (1- (u/(1 + u» ) , u~o. (7.1. 32)
J
oo k 1
= k Y - (y + 1)
-k-l k
dy = 1 - (w/(w + 1» .
w
(7.1.33)
Bounds of Random Variables 253
for M asymptotically equals 02-01 where O~01<02~1 (for N~oo, k/N~O. In many
applications (including global random search theory), the one-sided confidence intervals
for M (which can be obtained from (7.1.33) by setting 01=0, 02=I-y) are most n,aturally
used. Let us investigate their average length in order to conclude on the necessary number
of the order statistics l1N-i.
Proposition 7.1.5. Let condition a) hold, N~oo, k2/N~O, and yE (0,1) be a fixed
number. Then the average length of the confidence interval (7.1.34) for M asymptotically
equals (M-8N)<P(k,y) where
n k + 1 +lIa) ] lIa
<p(k, y) = r k,I-y [ nk + 1) - nl + lIa) ~ ( - log y)
(7.1.35)
for k ~oo.
k ~oo.
II a
nk + 1+ l/a)/ nk + 1)-k , k ~oo.
-11 a ) lIa
lIa ( 11k 11k
cp(k,y)-k / (l-y) -1 -(k(l-y )) -
lIa
1- yllk) ') 11k Iia lIa
- ( (l/k) , =(-y logy) -(-logy) .
match with the asymptotic requirements k~oo, k/N~O (N~oo). The numerical results
also demonstrates that the convergence rate in (7.1.35) increases if y decreases.
Note that analogous conclusion about the selection of k, via numerical analysis of the
two-sided confidence intervals (7.1.33), were drawn by Weissman (1981) who did not
deduce asymptotic expressions similar to (7.1.35).
(7.1.36)
lin
~N(M,y)-l-(1-y)'J(k,A)+A 8f'(k+ l-l/a,A)/f'(k)
Using (7.1.8) one obtains for N~oo the chain of (approximate) relations
256 Chapter 7
1 k-l
= r(k) J (Y1-Y o) exp{-yl}dYodyl=aN(M;y).
D2
a.( a.)k-l
0= a(z/(l + z» 1- (z/(l + z» .
k. . 1
=l- k1!.LC Jk(-l) J- (z/(l+z» u. Jy ke -Y( l-(A/y) 1/ a.) a.jdy-
",)·00
J~ 1
I k j j-l
-1- k!.LC k(-l) (z/(l+z»
a.j
Jyk e
00 _ (
Y
1/ a.
1-ajA
1/)
y- a. dy=
pI 1
1/a.
= 1- (1- 'Y)T(k,A) + A Or(k + 1 + lIa,A)/r(k).
1[
=k! J
00
y~-Ydy-.I,(-l)
k j-l j
C k(z/(1+z))
aj
x
Il J=1
(-lC)
a) 1 k j j-l aj
-'T ( k,( - K) - kf I,C k( -1) (z/(l + z)) x
j=1
(7.1.37)
lim 'T(k,ck) = O.
k--+oo
258 Chapter 7
k)k+l
= k!1 ckJYke -Y dy =
co (
c
T(k,ck) k' L
•
where
co
J
1= exp {k(log t - ct )dt }.
1
We shall apply the saddle-point approximation to the integral I. The function logt-ct attains
its maximal value (-c) at the interval [1,00) at t=l. This way,
The fourth statement of Theorem 7.1.2 is followed by the asymptotic inequality for the
probability of the second type error
(7.l.37) gives
applied for testing the hypothesis Ho: M<K versus HI: M~K is representable as
l-~N(M,l-y), and so can be approximated using the above formulas.
Bounds of Random Variables 259
An ordinary way of drawing statistical inference concerning M, when the tail index ex. is
unknown, consists of the substitution of some estimator ~ of ex. for ex. into the formulas
determining the statistical inference for the case of known ex.. Obviously, the accuracy of
such statistical inference is the main problem arising here. The most advanced results in
this field were obtained for the case, when ex. and M are estimated by the maximum
likelihood technique: below we shall state some of them.
First let us follow Hall (1982) to construct maximal likelihood estimators for M and
formulate their properties.
Suppose that instead of the asymptotic equality (7.1.25), the relation
a.
F(t)= 1- co(M-t)
takes place for each t in some interval [M-8,M], where co>O, ex.22, M are unknown
parameters, and the order statistics TlN, ... ,TlN-k fall into the interval [M-8,M]. Under
these suppositions the likelihood function is
260 Chapter 7
Maximizing this function with respect to M, Co and a, one obtains the maximum
likelihood estimators
1\
M,eOand&:
1\
expressed as follows: M is the minimal solution of the equation
(k [ ~l
. 0
J=
J
k~]
+ 1) 1/ L log (1 + 13.(~)) - 1/ L 13 .(~1) = 1
. 0 J
J=
1\ 1\
provided M ~l1N (if the solution does not exist, then llN is taken as M),
k-l
&: = (k + 1)/ L log (1 + 13 .<M)), (7.2.1)
j=O J
and
is used.
1\
Hall (1982) proved the asymptotic normality of the obtained estimate M, with mean
M and variance (a-1)2O"N,k2 , where
for a>2
(7.2.2)
for a=2,
(7.2.3)
~='Y/('Y+l/2),,¥=min{l, Ua}.
Smith (1987) used a different approach to construct the maximum likelihood
estimators. To describe it, let us introduce first the so-called generalized Pareto
distribution by
where cr>O, O<t<oo for v~O and O<t<cr/v for v>O. Now let y be a random variable with
c.d.f. F(t), upper bound M~oo and let h<M. Then
is the conditional c.d.f. of y-h given y>h. Pickands (1975) showed that (7.2.6) is a good
approximation of (7.2.7), in the sense of the relation
for some fixed v and function cr(h), if and only ifF is in the domain of attraction of one of
the three limit probability laws (namely,
».
and A(z)=exp(-e- Z In case of c.d.f. '¥a, the constant V in (7.2.6) equals l/a.
Now the approach of Smith (1987) is as follows. Let N be sufficiently large,
Yl, ... ,YN be independent realizations of the random variable y with c.d.f. F(t), h=h(N) be
a high threshold value, k be the number of exceedances of h, and Xb ... ,Xk denote the
corresponding excesses. That is, Xi=Yj-h where j=j(i) is the index of the i-th exceedance.
Under fixed N, the excesses xI, ... ,xk are independent and have the c.d.f. (7.2.7).
Relying upon (7.2.8), the generalized Pareto c.d.f. G(t;v,cr) is substituted for (7.2.7) in
the construction of the likelihood function. This way, its maximization yields the maximal
likelihood estimators
f'r N and {) for cr and v,
respectively. The corresponding estimator for M is
h+ f'r N /{) N'
Smith (1987) extensively studied the asymptotic properties (including the asymptotic
normality and efficiency) of these estimators for M and v=l!a) under fairly general
262 Chapter 7
conditions on F. Smith's results cover the case 0< a <2, together with the cases of a,2:.2
and the other two limit laws for the extremes.
To construct confidence intervals for M and to test statistical hypothesis about M in the
case of unknown a, one can use the above mentioned results of Hall and Smith,
concerning the asymptotic normality of the maximal likelihood estimates of M. Recall
again that, generally, this approach is applicable in the case when N is very large. The
alternative techniques of de Haan (1981) and Weissman (1982) seem more suitable, if k
can not be chosen to be very large (this holds e.g. for moderate values of N, say
N",,100+200). De Haan proved that for N~oo, k~oo, k/N~O, the test staistics (4.2.34)
converge in distribution to the standard exponential random variable with density e- t , t>O.
(Similar test statistics were considered by Weissman.)
7.2.2 Estimation of a
The estimation of the tail index a is a major task in drawing statistical inference about M.
It is important also in some other tail estimation problems and is often stated not only in
connection with the extreme value distribution 'Pa but including all three extreme value
distributions (for references, see Smith (1987».
A number of estimators of a are known, cf. Csorgo et al (1985), Smith (1987), the
above mentioned maximum likelihood estimators as well as the formulas
(7.2.9)
and
(7.2.10)
where N~oo, m~oo, k~oo, m<k, k/N~O. The estimator (7.2.9) was proposed by
Pickands (1975) and thoroughly investigated e.g. by Dekkers and de Haan (1987).
(7.2.10) was proposed by Weiss (1971) who formulated also some of its asymptotical
properties. Below we derive some more general results concerning (7.2.10) and modify
it, to reduce its bias.
Theorem 7.2.1. Let condition a) of Section 7.1 hold, a>1, N~oo, k~oo, k/N~O,
m/k~'t where 0<'t<1. Then the estimator (7.2.10) for a is consistent, asymptotically
unbiased, and there holds the relation
k~oo (7.2.11)
where
Bounds ofRandom Variables 263
Proof. According to (7.1.8), M-T1i-(M-eN)~I}la for i~k, N~oo where the random
variable Ili has the density xie- X /r(i+1). Using this approximation we have for N-+oo,
k-+oo,knN-+O,m/k-+'t
1/a 1/a
-I (Il/k) - (1l0/k)
= (- log 't)E log 1/a 1/a -
(Il m /k ) - (1l0/k)
-a 2(k -m)C mS
k 2
where
00 2
J
S 2 = h(t)exp {ks(t)}dt, h(t) = (1 + C 1 log t) .
o
The saddle-point approximation gives for S2:
j=O J
l
where
312
a 1 = V21t(lIt - 1) I log 2t,
Bounds ofRandom Variables 265
These expressions lead to (7.2.11) which in its tum implies the consistency of (7.2.10).
The theorem is proved.
Note that the function v't=(1-'t)/Ctlog 2 't) attains its minimal value (",,1.544) at
't o ",,0.2032: therefore to approach the minimal asymptotic variance of the estimator
(7.2.10), one has to choose m-k/5 for k~oo.
Comparing (7.2.11) for 't=0.2 and (7.2.3), one can deduce that for a2:.2.25 the
estimator (7.2.10) of the tail index a has smaller asymptotic variance, than the maximum
likelihood estimator (7.2.1).
Naturally, the estimator (7.2.10) would be better if the exact optimum value M were
substituted for l1N. Since M is unknown, l1N replaces it in (7.2.10): this enters a bias into
the estimate, for any fixed k and m. To reduce the bias, let us use the estimate
where & is the Weiss estimator (7.2.10). The numerical investigations indicate that for
moderate values of k, the estimator (7.2.13) is more accurate than (7.2.10) and the
accuracy difference increases, if a increases.
For suitably large Nand k, in constructing confidence intervals and testing statistical
hypothesis for a one can apply the results of Section 7.2.1 on the asymptotic normality of
maximum likelihood estimators. We shall consider another approach based on the
asymptotical properties of the estimator (7.2.10) which seems to be applicable also for
moderately large values of k.
Proposition 7.2.1. Let the conditions of Theorem 7.2.1 be fulfilled. Then the sequence
of random variables
266 Chapter 7
where tx is defmed by (7.2.10), converges in distribution to the random variable with the
c.d.f.
o for t :$; 0
k-m-1 . .
Fk(t) = t m+ 1 L C1m+l.(1- d for 0 < t < 1 (7.2.15)
i=l
1 fort~1
k!
= m' (k - m _ 1)' f x -k-l (x -
DO
1)
k-m-l
dx
. 'l/t
where as earlier ~i=~o+' .. +~i. Multiple integration by parts gives (7.2.15): the
proposition is proved.
The statement implies that the asymptotic level of the confidence interval (0~y~8~1)
(7.2.16)
for a equals 8-y, where tk,8 denotes the o-quantile of the c.d.f. (7.2.15): for illustration,
some quantile values are given in Table 5.
Bounds of Random Variables 267
k
5 10 20 50
In a standard manner the confidence interval (7.2.16) may serve as the base for
constructing the statistical hypothesis tests concerning a. The rejection region of the test
for the hypothesis Ho: a~ao against the alternative HI: 0.< 0.0 is
for k~oo.
Analogously, the rejection region of the test for the hypothesis Ho: 0.=0. 0 versus
Hl:a=a1>ao can be
for k ~oo.
268 Chapter 7
Theorem 7.3.1. Let the condition a) of Section 7.1 hold, N~oo, k~oo, k2/N~O, a>l,
and ex. be known. Then for optimal linear estimates MN,k, determined by the vectors
(7.1.21) and (7.1.22), the asymptotic equality
(7.3.1)
holds, where
2
(M-O N ) (1- a./2)/r(1+2/a) for1<a<2
2 2
O"N,k= (M-O N) (l_2/a)k-(l-2ICX) for ex. > 2 (7.3.2)
2
(M - ON) I log k for a = 2
(7.3.3)
i.e. the sequences (M-MN,k)/O"N,k are asymptotically Gaussian with zero mean and unit
variance.
The theorem will be proved in Section 7.3.3. (The proof was done in collaboration
with M. V. Kondratovich.)
Theorem 7.3.2. The components ai* and ai+ of the vectors (7.1.21) and (7.1.22) are
representable as
a~
1
=u./A
1
for i =O,l, ... ,k,
a~=u'/(A-B)
1 1
for i=O,1, ... ,k-1,
at =(Uk - B)I(A - B)
Bounds o/Random Variables 269
where a>O,
U o= (a + l)/r(1 + 21a),
u i =(a-l)r(i+ 1)/r(i+ l+21a) for i= 1, ... ,k-l,
uk = - (ak + 1)r(k + lIr(k + 1 + 21 a),
L lI(i + 1) for a = 2,
i=O
7.3 .2 Auxiliary statements and proofs of Theorem 7.3.2 and Proposition 7.1.3.
In this section all matrices have the order (k+l)x(k+l), all vectors belong to :Rk +1,
A=(I,,,.,1)', I.I
denotes the determinant, and the abbreviation
r .. =r(i+l+j/a) (7.3.4)
1,J
is used.
270 Chapter 7
Lemma 7.3.1. Let z, do, ... ,dk be vectors in Rk+l and D=II do, ... ,dk II be a
nondegenerate matrix. Then
(7.3.5)
where
k
A'D-1z=A'X=.L IDiI/IDI.
1=0
D z = II d 0 + Z, •.. , d k + z I
k
I
+ do, dl'd Z + z, ... ,d k + zl = ... = IDI + .L ID J
1=0
Hence
k
A'D-1z = i~olD dII DI = ID zll IDI- 1. (7.3.6)
Tranfonn the detenninant I Dz I in another way, subtracting the previous column from
each one, beginning with the last column:
Bounds of Random Variables 271
This, together with (7.3.6), gives the desired relation: the lemma is proved.
Lemma 7.3.2. Let the vectors x=(xo, ... ,xk)' and Y=(Yo, ... ,Yk)' consist of positive
numbers and the elements dij of the symmetrical matrix
where
Proof Multiplying the last but one row by xk/xk-l and subtracting it from the last one we
obtain
xk_IY 0 xkY o
x k_1Yk xkY I
= = IlJDk_II·
o 0 0 Ilk
where z=(zo, ... ,zk)'e :R.k + 1 and do, ... ,dk are the columns of the matrix Dk defined in
Lemma 7.3.2.
(7.3.7)
where
j.l..=x.(x.
1 1 1-
IY·
1
- x 1. y 1-
. I)/x.1- l'
<p.=z.-z.
1 1 1-
Ix./x.
1 1-
l'
v.1 = (x.1-IY·1- 2- x 1-
. 2Y·1- I)(x.1 -x.1-I)/(x.1-I-X.1-2)'
o= x o( x 1Y0 - x 0 Y0) I ( x 1(X 0 Y1 - x 1Y0) ).
Proof. Multiplying the last but one row by xIJxk-l and subtracting it from the last one, we
obtain
where
Multiplying the last but one column of flk' by (xk-xk- OI(xk-l-xk-2) and subtracting it
from the last one, we have
Bounds of Random Variables 273
Lemmas 7.3.1 - 7.3.3 will be used for investigating the case, when k~2, (1)0, and
vectors x and y consist of the numbers
where i,j=O,I, ... ,k and the symbol ri,m is determined by (7.3.4). In this case (7.1.14)
for i2j defines the elements of the matrix Dk which will be denoted by A or Ak. (The
matrix A is the same as in Proposition 7.1.2.)
Proof. According to Lemma 7.3.1, we have A.'A-IA.=~k/l A I where the first column of
the determinant ~k is z=A.=(1, ... ,l)'. By Lemma 7.3.2, we know that
274 Chapter 7
Taking into account (7.3.8) and that Z==A, simplify the expression for Ili' <Pi, vi (i==O, ... ,k)
and 0 of Lemma 7.3.3:
This way,
j j
( - 1) <p·n (v .Ill.)
J =2 1
= r.J"Orl 2/(a + 1)1.J, 2)·
i 1
-1 k
A'A A= I,r.o/r.2·
i=O 1, 1,
Now, (7.3.9) can be inductively deduced from this relation: the lemma is proved.
Lemma 7.3.5. For b==(bo, ... ,bk)', where bi==ri,l tri,o, we have
-1
A:A b = r k,l / r k,2· (7.3.10)
Proof By Lemma 7.3.1, A'A- 1b==Llld IA I, where the fIrst column of Llk equals z==b. The
expressions for Ili, vi, 0 are as in the proof of Lemma 7.3.4 and
k
A'A -l b = r 0 l/ r 02 - a-I I,r(i + lIa)/r. 2·
, , i=1 1,
(7.3.11)
Proof. Let us represent vector b as b=BA, where B is the diagonal matrix with the
diagonal elements bo, ... ,bk. We have
-1
b'A -l b = A'BA -IBA = A'(B- 1AB-1) A.
The matrix Dk=B-IAB-I is symmetric and its elements equal dij=Xi'yj' for i~, where
2
x'.1 =x./b.
1 1
=r.1,2r.1,o/r.1, I' y'. = y ./b . = 1.
j j j
Proof of Proposition 7.1.3. Lemma 7.3.2 gives IAk I=Ilk IAk-l I where
for k 4 00•
We have ai*=ui/A for i=O,l, ... ,k, where A=A'A-1A is calculated by (7.3.9). Represent
U'I. as
276 Chapter 7
-I
u. =A'A e.
1 1
where all components of the i-th coordinate vector ei are zero except the i-th which equals
1. Applying Lemmas 7.3.1 - 7.3.3 with Dk=A and z=ei, one obtains the expressions for
a1·* .
Tum now to the vector a+ and represent ai+ in the form
(7.3.12)
Since the last column of A is proportional to b, thus all the components ~i equal zero,
except ~k:
~ . = 0 for i
1
= O, ... ,k - 1, (7.3.13)
The expressions for ai+ follow from the expressions derived for at and from (7.3.9) -
(7.3.12). The theorem is proved.
*
E ( M N .k - M
) 2 2
-ON,k for N -7 00, k -7 00,
2
k IN -70 (7.3.14)
where
for a= 2.
that is (7.3.1) and (7.3.2) hold for the estimate MN ,k* determined by the vector (7.1.21).
Turn now to the estimate MN,k+ determined in (7.1.22).
By (7.1.22), a+=(A-IA_Pk~)lqk' where
2
qk=A'A -11._ (b'A -11.) ;(b'A -I b),
Using (7.3.9), (7.3.10), (7.3.11) and the Stirling approximation for k---7 oo , we obtain
A'A -l b = r /r _k -1/0.
k1 k,2 '
b'A
-1 ....2.
b = l-k/ (r k,2r(k + 2» -1.
With the help of (7.3.13), we get a*-a+ for k---7 oo : this yields (7.3.1) and (7.3.2) for the
estimate M N k +.
Prove no~ the asymptotic normality for the optimal linear estimates.
Set
Then, by (7.1.8)
where either ai=ai* or ai=ai+ are the coefficient of the vectors (7.1.21) and (7.1.22),
respectively, further on So,S 1,... are independent and have the density e-x, x>O.
Set
278 Chapter 7
i
U.
1
= .LV.
0 J
for i =0,1, ...
J=
Then
for i ~oo
at zero. Consequently,
and
(l+u/(i+l)
lIa u.
=1+u/(a(i+l»)+O C+\)
( 2) for i ~oo a.s.
This, together with ai~O for any fixed i and k~oo, yields
k l/a k u.
D N k - La. (i + 1) + La. \ -1/
, i=O 1 i=O 1 a(i + 1) a
By the Stirling approximation, ri,l tri,o-i lla for i~oo. Therefore, using once more the
relation ai~O for k~oo, we obtain
Bounds ofRandom Variables 279
k . 1/ a. k k (. 11 a. )
La.(1+1) = La.r./r. o+ La. (1+1) -r./r·0-70
i=O 1 i=O 1 1, 1, i=O 1 1, 1,
-1 k . -1+11 a. k
DNk-a La.u.(1+1) = Lv.S.
i =0 1 1 i=O 1 1
where
k
s. = a -1 La .(J. + 1) -1+1/a..
1 j=i J
Using the expression for ai=ai* (for ai=ai+ the expressions are asymptotically equivalent)
obtained in Theorem 7.3.2, we can derive the asymptotic forms for si (i=O,1 ,... ,k).
If a>2, then for k-7 OO
i=1,2, ...
So -lIlog k,
s.
1
_(C 1l2 _k- 1I2)/log k for i = 1, 2, ...
(k -7 00 )
We have
k 4 k k
LElv.s.1 = LS~Ev~=9LS~.
i=O 1 1 i=O 1 1 i=O 1
2 -1+2/a
Bk -(1- 2/a)k .
2
Bk -l/log k.
k 4 -4/3
I, s~ -( 6 + 4/r(1 + 2/3)) k .
i=O 1
k 4
""L.. s.4- (3)
2' k -2 log k.
i=O 1
k
I, s~ -8/log 3k.
i=O 1
Let us remark that there is another method of proving the asymptotic normality
(7.3.3), i.e. the second part of Theorem 7.3.1. This method consists in referring to the
asymptotic normality result of Csorgo and Mason (1989) and mentioning that the optimal
linear estimators MN k * and MN k + asymptotically, if k--7 oo , coincide with Csorgo-
Mason estimators which are dete~ined by the coefficient vectors a with components
(7.1.27).
Indeed, the asymptotic coincidence of the estimators MN k * and MN k + was
established in the proof of Theorem 7.3.1. Further, applying d.1.27) and Theorem
7.3.2, that contain explicit expressions for the components ai and at of the Csorgo-
Mason and optimal linear estimators, respectively, we obtain for k--7 oo , <x>2:
* (a+1)(a-2) k2/a-1
a O- anI +2/a) ,
a~--72-<x,
282 Chapter 7
* ( a- 1)(1 - 2/)
a.-a.- al.-2/a k 2/a-l for i <k, i ~oo,
1 1
where 'I' is the psi-function and Olog(1 +1/0)=1. Analogously, for a=1, k~oo we have
a~ -3/log k,
a k -a~ -- 2/log k,
Bearing in mind the asymptotic expressions presented above let us introduce a new, very
simple, linear estimator determining its coefficients by the formulas: for 1<a<2
-2/a
a.
1
= (a - 1) (1- a/2)i r(1 + 21 a), 0< i < k,
for a>2
0< i < k,
a k =2-a.,
Bounds of Random Variables 283
For <l=2 the expressions for at are so simple that it is hard to simplify them. Thus, we set
ai=ai* for <l=2, O~i~k.
The discussion concerning the Csorgo-Mason and optimal linear estimators leads to
the conjecture that the above defined linear estimator has the asymptotic properties (7.3.1)
and (7.3.3), i.e. it is also asymptotically normal and efficient.
CHAPTER 8 SEVERAL PROBLEMS CONNECTED WITH GLOBAL
RANDOM SEARCH
As it was pointed out in Section 2.3 and 4.1, the theory of global random search is
connected with many important branches of mathematical statistics and computational
mathematics. Two of them were already studied in Section 5.3 and Chapter 7, three others
will be treated in this chapter (note that their connection with global random search was
highlighted in Section 4.1).
(8.1.1)
where k=1,2, ... is the iteration number, Xl is a given initial point, Xl,X2, ... is the
sequence of points in X;C:RP generated to approach a local minimizer of f ' 'YloY2, ... are
nonnegative numbers called the step lengths, s 1,s2,." are random vectors in R.n called the
search directions. To construct each subsequent point in (8.1.1), one uses the results of
evaluations of the regression function f (i.e. the random values y(x)=f (x)+~(x), where
E~(x)=O) at the preceding points of the sequence and possibly also at some auxiliary
points. As usual, we shall suppose that different evaluations of f produce independent
random values.
The majority of works, devoted to the local optimization of a regression function,
studies the asymptotic characteristics of the algorithm-type (8.1.1). The extremal
284
Problems with Global Random Search 285
experiment design deals with the one-step characteristics of the algorithms, rather than
with their asymptotic properties.
The extremal experiment algorithms (the simplex method of NeIder and Mead (1965)
and steepest descent are, probably, the most popular of them) have been developed and
applied for optimization of real objects (even in absence of computers) and thus have a
number of properties that distinguish them from the stochastic approximation type
adaptive algorithms. The pecularities of most extremal experiment algorithms are due to
the inclusion of the following elements into each of their step: (i) statistical inference
(usually: linear regression analysis for constructing and investigating the local first or
second degree polynomial model of f ), (ii) specific experimental design for selecting
auxiliary points to evaluate f (the design criteria are chosen among the following:
symmetricity,orthogonality, saturation, rotatability, simplicity of construction, optimality
in some appropriate sense, etc.), (iii) selection of the search direction in accordance with
the regression models constructed (the least square estimate of the gradient of f at Xk is
customarily used as the search direction), (iv) selection of the step length at random via
evaluating f at several auxiliary points along the chosen direction.
As for the step length selection rules, they are thoroughly discussed in many works,
see for instance Ermakov et al. (1983). The procedure of Myers and Khuri (1979) and its
modification studied in Zhigljavsky (1985) seem to be the most promising and
recommendable. Below we shall deal only with the search direction construction problem.
for j=I,... ,N where zlj. ... 'Znj are the coordinates of the point zj- (Note that interesting
results concerning the choice problem of Xk+ 1 and applying the second order polynomial
model of f are derived by MandaI (1981, 1989).) Second, the k-th step design, i.e. the
point selection { zI. ... ,zN} is symmetrical, that is the equality
N
Lz .. =O
. 1 1J
(8.1.3)
J=
286 Chapter 8
of unknown parameters. If N>n and rankZ=n, then 9 can be estimated by the least square
estimator
1\ -1
9 =ClZ') 'ZY. (8.1.4 )
Here
is the design matrix and Y=(Yl, ... ,YN)' is the vector of evaluation results, in accordance
with (8.1.2) consisting of the elements
n
y.=9 0 + L9.z .. +~., j= 1, ..• ,N (8.1.5)
J i =1 1 IJ J
We shall maximize (8.1.6) with respect to s, in the class of linear statistics of the form
Problems with Global Random Search 287
where A is the set of matrices of the order nxN and Y consists of the evaluation results
(S.1.5), thus satisfying the conditions
Set
-1/2
teA) = teA,S) = S's(A), n(A) = n(A,S) = (t(A) - E t (A»(var t(A» .
-1/2}
Pr{t(A) > O} = Pr { 'Il(A) > - (var teA»~ E teA) (S.1.9)
Under fixed AeA and SeRn, the random variable n(A,S) is a linear combination of the
random variables ~1"'" ~N' having zero mean and unit variance. If ~j are Gaussian, then
'Il(A,S) follows Gaussian distribution as well and the probability (S.1.9) is completely
determined by the magnitude
-1/2
K(A) = K(A,S) = (var teA»~ E teA) (S.1. 10)
1/2
max K(A) = a-1(S'ZZ'S) (8.1.12)
AEA.
Proof. We have
Et (A) = S'AZ'S,
2
'L(A) = a 2 J2(A) = (S'AZ'S) /(S'AA'S)
S'EA'S =S'AE'S,
let us compute the derivetive of'L at the point (i.e. matrix) A in the direction E:
O'L(A)
~ = lim ['L(A + aE) -'L(A)]/a=
a~O
Let us characterize now each matrix A for which the derivative equals zero, for all Ee A.
The equality O'L(A) I dE=O for all Ee A is equivalent to that either
9'AZ'S = ° (8.1.13)
or
takes place. If (8.1.13) holds, then 'L(A)=O and thus 'L is being minimized: therefore we
shall assume that (8.1.13) does not hold, i.e. S'AZ'S*O. The equality (8.1.14) is
equivalent to A'S=eZ'S, where
Problems with Global Random Search 289
c = A'99'A/Z'99'A
is a positive number. Substitute cz'e for A'e in the expression for teA): this way, one
obtains that if for some AeA, 9:;!:O (S.1.14) is valid, then t(A)=9ZZ'9 the maximal value
of t under a fixed e. This value is obviously reached for A=Z. The theorem is proved.
The theorem implies that the optimal search direction, in the above sense, is given by
s* = LX (8.1.15)
1\
The advantages of s* compared to s= 9, i.e. the least square estimate (S.1.4) of a, are
two-fold: it is much simpler to compute and can be used also for N~n; moreover, the
search direction (S.1.15) is of slight sensibility to violations of the validity of the linear
regression model (S.1.5). In fact, (8.1.15) is nothing but a cubature formula for
estimating the vector
Jzf(z)dz
u
which converges to a=Vf(Xk) when asymptotically decreasing the size ofU.
Theorem 8.1.2 yields that if the search direction s=sk is chosen at each step of the
algorithm (8.1.1) by (8.1.15), then the experimental design (i.e. the selection rule for
points zl, ... ,zN in U) is to be selected so that the values a'zz' a are as large as possible,
for various a. Let us formulate the design problem in its standard form for the regression
design theory (see e.g. Ermakov and Zhigljavsky (1987)).
Let E be an arbitrary approximate design, i.e. a probability measure on U,
J
M(E) = zz'E (dz) (8.1.16)
u
corresponding to evaluations of f at the points zl, ... ,zN, the matrix (8.1.16) equals
-1 -1 N
M(EN) = N ZZ'= N LZ .Z'..
j=1 J J
290 Chapter 8
Considering the design problem in the set of all approximate designs, we have the class of
criteria
depending on the unknown parameters 8. Since the true value of 8 is unknown, the true
optimality criterion <I>e is unknown, too. In a typical way, Iet us define the Bayesian
criterion
If prior information about 8 is essentially not available, then it is natural to take S as Q and
the uniform probability measure on Q as v(d8). Now, if Q=S, then by virtue of the
maximal matrix eigenvalue properties, (8.1.18) is nothing but the maximal eigenvalue of
M(e): this way, <I>M is the well-known E-optimality criterion in the regression design
theory. Thus, the optimal design problem for the maximum criterion (8.1.18) is reduced
to that of classical regression design theory.
Let us turn now to the Bayesian criterion (8.1.17). Denoting
L = J88'v(d8)
n
one can easily derive (see Zhigljavsky (1985» from the equivalence theorems of
regression design theory that the set of optimal designs with respect to the Bayesian
criterion (8.1.17) i.e.
{ arg !~ z'Lz }
Thus, the indicated design problem either is very easy or can be reduced to a standard
problem.
292 Chapter 8
has order mxm and in most cases is proportional to the covariance matrix of the estimator,
g(x)=(gl (x), ... ,gm(x», is a vector of linearly independent piecewise continuous functions
from L2( X,v), P is the set of densities p such that II D(P) II <00, A is a matrix in the set N
of positive semi-definite matrices of order mxm for which D(p)-Ae N for all pe P (in
some cases A is the zero matrix), <1>: N --+:R. is a continuous convex functional.
The most well-known problem leading to (8.2.1) is the Monte Carlo simultaneous
estimation of several integrals and is formulated as follows.
Let m integrals
be estimated. The ordinary Monte Carlo estimates of the integrals (8.2.3) are constructed
in the following way. First a probability distribution P(dx)=p(x)v(dx) on the measurable
space (X,B) is chosen: here the density p=dP/dv is positive modulo von the set
Problems with Global Random Search 293
Then N independent elements xlo""xN are generated from the distribution P. Finally, the
integrals (8.2.3) are estimated by the formulas
Set
Proposition 8.2.1. If the density p=dP/dv is positive modulo v on the set X*, then the
estimators (8.2.4) are unbiased
1\
(i.e. E'L='L)
Proof The unbiasedness of the estimators (8.2.4) is an evident result, well-known in the
Monte Carlo theory. We have for variances and covariances the relations
It follows from the proposition that the matrix (8.2.2) with A='L'L' represents the
quality of the Monte Carlo estimators (8.2.4) depending on the density p. The problem of
optimal density selection was earlier investigated in the above framework for the two
optimality criteria depending only on diagonal elements of the matrix D(p). Evans (1963)
solved the problem for the criterion
m
<I>(B)= La.b .. , (8.2.5)
i=l 1 11
where bii are the diagonal elements of a matrix BeN and al, ... ,am are fixed nonnegative
numbers: this problem is rather simple. Mikhailov (1984, 1987) solved the extremal
problem (8.2.1) for the MV -optimality criterion
ES(x) = 0,
k = 1, ... ,m,
of the function f with respect to a set of functions { f 1,... , f m} be estimated.
Suppose now that the points xl, ... ,xN are randomly and independently chosen,
having the same distribution P(dx)=p(x)v(dx), where the density p=dP/dv is positive on
the set
(8.2.7)
Problems with Global Random Search 295
for'Lk then one can prove (analogously to Proposition 8.2.1) that these estimators are
unbiased; further on, that their covariance matrix is equal to
1\
cov'L = N-1 D(p),
where D(p) is determined by (8.2.2), A= 'L'L and I
V2
gk(x) = (f2(x) + (J2(x») f k(x), k = 1, .•• ,m.
Therefore the optimal density choice problem is a particular case of the problem (8.2.1).
Let r(x)=f(x)-('Llf 1(x)+... + 'Lmf m(x». If the least square estimates are used instead
of (8.2.7), then their covariance matrix is represented as follows
1\
cov'L
1
= ND(P) + 0 ( N -1) ,
whereA=O,
V2
gk(x) = (f2(x) + cr 2(x)) f k(x), k = l, ... ,m,
(see later Theorem 8.3.4). This way, we have arrived to the problem (8.2.1) again.
The density p is interpreted as the experimental design; the problem of its optimization
is similar to the approximate optimal design problem in classical regression design theory.
The main difference between these problems lies in the following. In the classical theory
experimental design stands in the numerator of the matrix elements and a convex
functional of this matrix is minimized, while the design in the extremal problem (8.2.1)
stands in the denominator of the elements. From the theoretical point of view, the problem
(8.2.1) is a little more complicated than the regression optimal design problem. The main
additional complexity is in the existence of the optimal design.
The task of selecting the optimality criterion <I> is analogous to the corresponding one
in classical regression design theory. The main difference between them lies in imposing
stronger conditions on the optimality criterion in the problem (8.2.1), than the convexity
and monotonicity required in the regression design theory. Subsection 8.2.2 will describe
these conditions.
Subsection 8.2.3 covers the existence and uniqueness problems: their solution is
based on general convex analysis.
Subsection 8.2.4 presents the necessary and sufficient conditions for densities to be
optimal. The basis for the results of the subsection is the equivalence theory for optimal
regression design developed by J. Kiefer, J, Wolfowitz, V. Fedorov and others: its
statements are also analogous to the equivalence theorems.
Subsection 8.2.5 describes algorithms for constructing optimal densities and the
structure of optimal densities. Nondifferentiable MV - and E-optimality criteria are treated
in Subsection 8.2.6.
Subsection 8.2.7 highlights the difference and similarities between classical regression
design theory and the results presented here.
296 Chapter 8
8.2 .2 Assumptions
Below we shall suppose that the functional <1>: N -+:R.. is nonnegative, continuous, convex
and increasing. The increase of <1> is defined as follows: if B,Ce N and B>C (or B~C),
then <1>(B»<1>(C) or, respectively, <1>(B)~<1>(C).
Let n be the closure of the set N -A={ C-A, Ce N} containing some matrices with
infinite elements. Extend the functional <1> from N onto n preserving its continuity and
convexity. Suppose that this extension <1>: n-+:R..u{ +oo} has the property
a) <1>(B)<oo for Be n,if and only if all elements of B are finite.
We also suppose that the following simple condition is satisfied:
b) there exists a density pin P for which the matrix D(p) consists of finite elements
only.
The above suppositions are required to hold everywhere. Many widely used criteria
satisfy them, for instance, the linear criterion
where Amax(B) is the maximal eigenvalue of the matrix B, and the so-called <1>rcriterion
(8.2.10)
for 1.:::; r <00. If r =1 then (8.2.10) and (8.2.8) with L=Im coincide. If r -+00 then (8.2.10)
converges to (8.2.9). On the contrary if -00< r <1, then according to Pukelsheim (1987)
the criterion (8.2.10) is increasing and concave which case is unsuitable here. The same is
true for the well known D-criterion
11m
<1>(B) = (m- 1 det B) (8.2.11)
consisting of the partial derivatives of <1>(B) with respect to the elements bij of the matrix
BeN.
All the above suppositions are related to the functional <1>. Besides them, we need two
further assumptions concerning the functions gl, ... ,gm' It will be supposed throughout
that they are piecewise continuous linearly independent functions from L2(X,v).
Sometimes we shall also use the following condition of their v-regularity:
c) functions gl ,... ,gm are linearly independent on any measurable subset Z of the set
X with v(Z)>O.
In a slightly different form the v-regularity condition c) was used by Ermakov (1975),
when investigating random quadrature formulas.
Note also that we do not require more concerning the set X, than its measurability.
Proposition 8.2.2. For all p,q form Q and any O<t<1 we have
J
D(t,p,q) = u(x)g(x)g'(x)v(dx) (8.2.12)
(8.2.13)
Proposition 8.2.3. If condition c) holds, then the matrix (8.2.13) is positive definite
for any t from (0,1) and all p,q from Q, p;tq (modulo v).
J
D(t,p,q) = u(x)g(x)g'(x)v(dx) (8.2.14)
z
where Z={xe'X: u(x»O}. Since p;tq (modulo v) and te (0,1), then v(Z»O. Using the
supposition c), we obtain that functions gb ... ,gm are linearly independent on Z.
Evidently, the functions
FuN g.(x),1
i= l, ... ,m,
have the same property. The positive definiteness of matrix (8.2.14) follows now directly
from the definition of positive definiteness: the proof is completed.
Proposition 8.2.4. If the functional <I> is convex, increasing and the conditions a), b)
are fulfilled, then the functional cp is convex on Ll ('X,v). Besides, if either the functional
<I> is strictly convex on N or the condition c) holds, then cp is a strictly convex functional
onP.
Proof. Let p,qe Ll ('X,v), p;tq (modulo v), and te (0,1). The convexity of cp implies that
the inequality
(8.2.16)
(8.2.17)
Coupling the inequalities (8.2.17) and (8.2.18) we obtain (8.2.15), i.e. the convexity of
cpo
Problems with Global Random Search 299
The strict convexity of <p on:P is equivalent to the strict inequality in (8.2.15) for p and
q from:P. This inequality follows from the strict inequalities in either (8.2.17) or (8.2.18)
or both. The strict inequality in (8.2.18) follows from the strict convexity of <1>. On the
other hand, if the condition c) holds then, by virtue of Proposition 8.2.3, the strict
inequality in (8.2.16) takes place. This way, by virtue of the increase of <1> the strict
inequality in (8.2.17) is valid. The proof is completed.
Now we are able to investigate the existence and uniqueness of the optimal density.
Theorem 8.2.1. If the functional <1> is continuous, convex, increasing and conditions
a), b) are fulfilled, then the optimal density p* in:P exists.
J
S = {p e LtX,v): p;;:: 0, p(x)v(dx) ~ 1}
contains :P, is bounded in norm and (due to the Fatou theorem) closed by measure.
Let us show that the functional <p is lower semi-continuous in measure: it means that
the inequality
holds for any sequence PJ,P2,'" of elements of Ll ('x ,v) which converges in measure V
to an element p of Ll (X,v).
Using the Fatou theorem, we have for any vector beRm:
. . . . Jb'g(x)g'(x)b
b'(rn:n mfD«Pi) ))b=rn:n mf ( (») v(dx)-b'Ab;;::
l~OO + l~OO Pi x
+
~ infD«p.) ) ;;::D(p.)
l~OO 1 + 1
holds. From this, we obtain the inequality (8.2.19) and the continuity and monotonicity of
the functional <1>.
Theorem 6 from §5 Chapter 10 of Kantorovich and Akilov (1977) states a generalized
Weierstrass-type theorem, viz. that a convex lower semicontinuous in measure functional
on Ll (X ,v) attains its minimum value in any subset of Ll (X ,v) closed by measure and
bounded in norm. Using Proposition 8.2.4 and the above results we get that the functional
300 Chapter 8
<p attains its minimum value in S at a certain point P*E S. By a) and b) we have
<D(D(p*»<oo and II D(p*) II <00. Let us show that Jp*(x)v(dx)=l. It follows by the
incidence P*E P and the statement of Theorem, too. Assuming the contrary, let
r= Jp*(x)vCdx) < 1
and put q*(x)=p*(x)/r. We have q*EP,
Since II D(p*) II <00 and the functions g1, ... ,gm are linearly independent on 'X, therefore
the functions
-V2
(P*Cx») g.(x), i= l, ... ,m,
1
have the same property. Analogously to Proposition 8.2.3, we can see that the matrix
D(p*)+A is positively defined. From the monotonicity of the functional <D we get
<D(D(p*»><D(D(q*», but this inequality contradicts to the optimality ofp*. Consequently,
r= 1 and P*E P: the theorem is proved.
Let us tum now to the uniqueness of the optimal density.
Proposition 8.2.5. Let the conditions of Theorem 8.2.1 be satisfied and assume that
either the functional <D is strictly convex or the condition c) holds. Then the optimal
density P*E P exists and is unique modulo v.
This statement is a simple corollary of Theorem 8.2.1 and Proposition 8.2.4.
Theorem 8.2.2. Let the assumptions of Theorem 8.2.1 be fulfilled and the functional <D
be differentiable. Then a necessary and sufficient optimality condition, for a density p*, is
the fulfilment of the equality
Proof. We shall use the necessary and sufficient condition of optimality, for a convex
differentiable (along all admissible directions) functional on a convex set (see e.g.
Ermakov (1983), p.55 or Ermakov and Zhigljavsky (1987), p.105). Let us compute the
derivative
and find the density p* in the set1' such that n(p,h)~O, for all densities h on 'X. Simple
calculations (see Zhigljavsky (1985» yield
So n(p,h)~O for all h, if and only if the inequality 'I'(x,p)~c(p) holds for v-almost x in 'X.
The statement of the theorem follows from the equality
J'I'(x,p)p(x)v(dx) =c(p)
which is easily verified and is valid for all densities pEP. The theorem is proved.
Suppose now that the optimality criterion is nondifferentiable and minimax, i.e. it is
expressed by
where V is a compact set and all functionals <1>v are convex and differentiable.
Theorem 8.2.3. Let the functional <1> have the form (8.2.23), where all functionals
<1>v(VE V) satisfy the assumptions of Theorem 8.2.2 and the function <1>v(B) is continuous
for any fixed matrix B from N. Then the fulfilment of the inequality
is a necessary and sufficient condition for the optimality of density p*: here'l'v and cv are
defined by (8.2.21), with the substitution of <1>v for <1>.
The proof is analogous to the proof of Theorem 8.2.2 and uses the formula
302 Chapter 8
where the derivative I1v is defined in (8.2.22), with the substitution of <l>v for <1>, and
equals
IT (p,q)
v
=cv(p) - J'If v(x,p)q (x)v(dx)
Note finally that Theorem 8.2.3 is also analogous to the corresponding statement in
regression design theory.
Consider first the structure of optimal densities for the linear criterion (8.2.8). In this case
o
<I> (B) = d trLB/dB = L
and the optimality condition (8.2.20) is equivalent to
1/2 1/2
p*(x) = (g'(x)Lg(x» / J(g'(z)Lg(z» v(dz). (8.2.24 )
It should be noted that the expression (8.2.24) can be obtained by simpler tools, using the
Cauchy-Schwarz inequality, and that the linearly optimal density is always unique,
modulo v. The following example presents a case, when the linearly optimal density does
not belong to the set P.
Example. Let the functional <I> have the form (8.2.5), m=2, a 1=1, a2=0. The condition a)
does not hold and the optimal density has the form p*= I gl IIf I gl Idv. If the function gl
vanishes on a subset Z of X with measure v(Z»O, but the function g2 does not vanish on
the subset then p* does not belong to P. At the same time, if gl does not vanish on X and
g1 21 I g2 1eLl (X,v), then p*e P.
Consider now the structure of the optimal density and algorithms of its construction,
for an arbitrary differentiable criterion.
If the functional <I> is differentiable, but can not be represented as (8.2.8), then the
optimal density has the form (8.2.24) again, although now the matrix
o
L = <I> (D(p*»
is unknown and depends upon the optimal density p*. This is a simple corollary of
Theorem 8.2.2.
Hence, the problem of optimal density construction may be considered as the problem
of constructing the optimal matrix L. It can be solved by general global optimization
Problems with Global Random Search 303
techniques. If the values <l>(B) depend on the diagonal elements of matrices B only, then
the matirx L is diagonal and the extremal problem is not very complicated.
If the number m is large and the set 'X is either discrete or has a small dimension, then
the above algorithms may be of smaller efficiency than those described below. They use
the features of the problem and are analogous to the construction methods in optimal
regression design. They are pseudogradient-based algorithms in P, using the expression
for the derivative IT.
The general form of these methods is
Here PIEP is an initial density, <Xl,<X2, ... is a numerical sequence the choice of which
may be the same as in classical regression design. If <Xk>O, then the density hk has to
satisfy the inequality IT(Phhk).$.O: for example, one such density is proportional to the
positivity indicator of the function
(8.2.26)
are to be satisfied. The former is satisfied, for instance, for the density hk proportional to
the negativity indicator of the function (8.2.26) with Ek.$.O; the latter relation is equivalent
to the nonnegativity of the density Pk+ 1(x).
Unlike in case of differentiable criteria, the necessary and sufficient condition of optimality
for nondifferentiable criteria (see Theorem 8.2.3) is nonconstructive and can not be used
for the construction of optimal densities, but only for verifying their optimality.
Nevertheless, for a large class of nondifferentiable criteria, we are able to show that the
structure of optimal densities is the same as above (cf. (8.2.24».
Theorem 8.2.4. Suppose that there exist a functional <l> and a sequence {<l>il of
functionals on N which are convex, increasing, continuous, and for which the conditions
a), b) are valid. Let also the functionals <l>i be differentiable, the condition c) hold and
<l>.1+ l(B) ~ <l>.1 (B), i = 1,2, ... , lim <l>. (B) = <l>(B)
i-7 00 1
304 Chapter 8
for each Be N. Then the <I>-optimal density p* exists, it is unique modulo v, and has the
form (8.2.24).
Proof. Denote the (f)i-optimal density by Pi*. The existence and uniqueness of the
densities p* and Pi* follow from Theorem 8.2.1 and Proposition 8.2.5. Theorem 8.2.2
gives
(8.2.27)
where LloL2, ... are matrices from N. Without loss of generality, we can assume that
IILili =1 for all i=I,2, ... Let us choose now a subsequence {4j} from the sequence {4}
converging to a certain matrix L; notice that Le N and II LII =1.
Define the density p by formula (8.2.24) and note that the pointwise limit of the
sequence { Pi*} exists.
Analogously to the proof of Lemma 1.11 in Fedorov (1979), one is able to show that
if a limit point of the sequence { Pi *} exists, then this point is the <I>-optimal density p*.
By virtue of the uniqueness of the limit, we have p*=p. Finally, from the uniqueness of
the <I>-optimal density p* we obtain that the limit of the sequence {4} exists and equal L.
The theorem is proved.
It follows from the theorem that, for many nondifferentiable criteria, the problem of
optimal density construction is reduced again to the optimal choice of the matirx L in the
representation (8.2.24).
Two nondifferentiable criteria of special importance are considered below.
Let <I> have the form of (8.2.9), i.e. it is an E-optimality criterion. Determine the
functionals <I>i by formula
Vi 1/ j
(~ a~) ~ (~ at) ,
k=1 k=1
l:S;i:S;j:S;oo (8.2.28)
for any nonnegative numbers a b ... ,am. Now we obtain the monotonicity of the
convergence of <I>i(B) to <I>(B) from the inequality (8.2.28) and the representation
. m.
trBl = L A.~
k=1
Problems with Global Random Search 305
where Al ,... ,Am are the eigenvalues of the matrix B. Hence, the analogue of Theorem
8.2.4 is applicable and the E-optimal density has the form (8.2.24).
Let us turn now to the MV -optimality criterion (8.2.6). First, let us simplify the
statement of Theorem 8.2.2 for the MV-criterion. In the representation (8.2.23) we have
We shall show now that Theorem 8.2.4 can be applied to the MV-criterion. Determine the
sequence {<I>il by
The convexity of these criteria follows from the Minkovsky inequality and their
monotonous convergence to <I> follows by (8.2.28). Theorem 8.2.4 says that if the
condition c) holds, then the MV-optimal density p* exists, it is unique modulo v, and has
the form (8.2.24), where the matrix L is diagonal. This means that if the condition c)
holds, then the MV -optimal density exists and has the form
where
Theorem 8.2.5. The MV-optimal density exists and can be represented in the form
(8.2.29), where
u= argmax G(v),
VEU
G(v) = [!(.~\v kgi(XfV(dX)]
(here akk is the k-th element of the matrix A).
306 Chapter 8
Proof By virtue of Theorem 8.2.1, the MV-optimal density exists. Consider the value
where dii(P) are the diagonal elements of the matrix D(P). We have
m
I = min max d .. (p) = min max 2. v.d .. (p).
peS I:'>i:'>m 11 peS veD i=1 1 11
The proof of Theorem 8.2.1 indicates that the set S is closed by measure and bounded in
norm and the functional <p is convex and lower semicontinuous in the measure. Using
now the minimax theorem of Levin (1985), p.293, we have
I = maxI(v~
veD
m
I(v) = min 2. v.d .. (p)
peS i_II 11
By virtue of Section 8.2.5, the minimum ICv) is attained at the density Pv and equals G(v).
This fact completes the proof.
Let us remark that under some additional restrictions on the functions glo ... ,gm,
Mikhailov (1984) proved a statement similar to Theorem 8.2.5: the latter is due to
Mikhailov and Zhigljavsky (1988).
First we shall point out some general differences between the above exposition and the
classical regression design.
The fIrst difference is the existence of an optimal design. Two conditions a) and b)
concerning the optimality criterion, are required to ensure the existence of the optimal
density. These conditions do not fIgure in the regression design theory; however, they are
not restrictive.
The second one is that the discreteness of optimal measures in the present case is not
desired, even being not admissible.
The third is that if the functions gl, ... ,gm are v-regular (see condition c», then
according to Proposition 8.2.4 a convex functional <I> on N induces the strictly convex
functional <p on the set of densities :P. This allows to guarantee the uniqueness of optimal
densities, e.g. for the E- and MV-criteria.
The fourth is that the structure of optimal densities is the same: for a large class of
criteria, the optimal densities have the form (8.2.24).
Finally, the fIfth difference is that there are no conditions concerning the set 'X except
its measurability, and the functions glo···,gm are not necessarily continuous.
Problems with Global Random Search 307
In some cases the matrix A depends on unknown integrals '1 and, hence, the optimal
densities depend on them, too. In these cases the extremal problem (8.2.1) is analogous to
the nonlinear regression design problem and the optimal density corresponds to the locally
optimal design. Sequential, Bayesian or minimax approaches can be used to determine the
optimal densities.
The optimality results (Theorems 8.2.2 and 8.2.3) are analogous to the equivalence
theorems in classical regression design: apparently, theorems analogous to the dulaity
theorems of Pukelsheim (1980) can also be proved.
308 Chapter 8
(8.3.1)
and the evaluation number N tends to infinity (or is sufficiently large). The projection
estimation problem consists of choosing for the spaces Lm dimensions m=m(N), the
sequence of passive designs {x 1o ••• ,xN}, and a parametric regression estimation method
under the supposition f e Lm in such a way that the obtained estimate
(8.3.2)
computed using Xj and Yj (j=l, ... ,N). We shall assume that the estimation method of ei is
linear with respect to Yl, ... ,YN.
Let the number N of regression function evaluations as well as the estimation algorithm be
fIxed. We have the following decomposition of the inaccuracy
f-j
corresponding to an estimate of f:
sup
fe:F
p(f - En (8.3.5)
is often considered.
If a measure A. may be determined on the set :F" reflecting additional information about
the unknown function fe:F", then - instead of (8.3.5) - the first summand in the right-hand
side of (8.3.4) is usually replaced by the Bayesian criterion
As for p, the quadratic metric is considered below: for this, as easily seen, instead of the
inequality (8.3.4) the equality
2 2
EJ(f(x)-j(x») v(dx)= JCt(x)-Ej(x») v(dx) +
(8.3.6)
takes place, where v(dx) is a given probability measure on (X,:B). Consequently, a mean
square summed inaccuracy
2 2 ~2 ~ ~2
B +V =supJ(f-J) dV+EJ(EJ-J) dv (8.3.7)
fer
completely characterizes the error of the method. The ftrst term in (8.3.7), i.e. the quantity
2 ~ 2
B = sup J(f -EJ) dv
fer
is the square of the so-called bias inaccuracy, while the second term
is the mean square of the random inaccuracy. Since for all xe X the variance of the
estimate
j(x)
is equal to
Problems with Global Random Search 311
be the Fourier coefficients of f with respect to the functions fi (i= 1,... ,m),
9=(91 ,... ,e m )',
?
J(x) = eF(x)
1\'
e
1\
is the vector of linear estimates of e. Assume that o2(x)= cr2=const. for all XE X.
By virtue of (8.3.2) and (8.3.8), y2 is representable as
According to the classical Gauss-Markov theorem applied in regression analysis, the best
linear unbiased estimates are the least square estimates for which, in particular, the
quantity
1\
trcove
is minimal. The main object of this subsection is the lower estimation of Y: therefore we
shall consider only the least square estimates for which
(8.3.9)
where
312 Chapter 8
is the normalized information matrix of the experimental design that can be written as
follows
Proposition 8.3.1. Let the functions 11, ... ,1 m be v-orthonormal and uniformly lower
bounded by a constant K, cr 2 (x)=cr2<oo, an estimate
are linear statistics, and let the matrix M(lON) be nondegenerate. Then the inequality
(8.3.11)
holds.
The proof is based on the following statement.
Lemma 8.3.1. If A is a positive definite matrix of order mxm, then the inequality
-1 2
tr A ;;;: m Itr A (8.3.12)
holds.
Proof Let B be a positive definite matrix of order mxm and A1, ... ,Am be its eigenvalues.
By virtue of
m m
trB = I,A., det B= Ill...
i=l 1 i=l 1
11m
ITA.J
(i=1 1
: :; m- 1iI.=1 A.
1
11m
trB ~ m(det B) . (8.3.13)
Applying the latter inequality to the matrices B=A-l and B=A, we obtain that
11m
tr A-I ~ m(det A-I) = m(det A) -11m ~
-1
~m«trA)/m) =m 2/trA.
The lemma is proved.
Proof of Proposition 8.3 .1. Let e(dx) be any approximate design, i.e. any probability
measure on (')(.,:8) and
M(e) = fF(x)F'(x)e(dx).
-1 m 2
tr(M(e» ~ m 2/tr M(e) = m 2/ L f f .(x)e(dx) ~ m/K
i=1 1
Theorem 8.3.1. Let cr 2 (x)=cr 2 =const, N~oo, m=m(N)~oo, m/N~O. Then the
convergence order of the mean square of the inaccuracy (8.3.7) can not be less than
(8.3.14)
for any projection estimate of the form of (8.3.2), any method of linear parametric
estimation and any passive design (xlo ... ,xN) sequence where
314 Chapter 8
Omitting the proofs which are rather tedious, we formulate here two examples of
asymptotically optimal projection procedures involving deterministic designs (i.e. rules for
the choice of points xl, ... ,xN).
Let Z be the set of nonnegative integers, K=Zn be the multi-index set,
k=(k}, ...kn)e K be a multi-index, {/k(x), ke K} be a complete v-orthonormal set of the
functions /k(x)=exp{ -21ti(k,x)} on X=[O,l]n, Lm be the set of linear combinations of the
functions fk(X) corresponding to subsets Kq ofK:
where
m=cardK q = I, 1,
kEKq
11.11 is a given positive function on K, and (k,x) is the scalar product of ke K and xe X.
The functional set Rna for integer a~1 consists of real functions defined on
X=[O,I]n continuous partial derivatives
Define first the functional class f" =f"(Hna), as the subset of Hna containing all periodical
functions of Rna with period 1, respectively to each coordinate. For each fe f"(Rna),
one has Iek I~LII k II-a, where L is a constant,
Theorem 8.3.2. Set q=Nl/2a, select the lattice grids EN(5) of Section 2.2.1 as the
experimental design and estimate the Fourier coefficients 9k by the least square algorithm.
Then the summed mean square inaccuracy (8.3.7) of the projection estimate of a
regression function f E f(Hn a) decreases with the rate
N ~oo. (8.3.16)
This is the optimal decrease rate of B2+V2 on f(Hn a) for any linear estimation method,
choice q=q(N), and any experimental design sequence: the proof of this statement can be
found in Zhigljavsky (1985) and Ermakov and Zhigljavsky (1984).
Turn now to the functional set Wna consisting of functions f on X=[O,I]n which
have the derivatives
(0
1
:i t .),
~ ~ ~ an, 0 ~ ~. ~ an, ~ = i=l 1
define:F' =:F'(Wna) as the subset of W na containing all periodical functions with period 1,
respectively to each coordinate.
Zhigljavsky (1985) proved the analogue of Theorem 8.3.2 for the class :F'(wna). Its
formulation would coincide with that of Theorem 8.3.2, if we substitute :F'(wna), the
cube grids 3N(1),
and N-l+n/2a (N~oo) for :F'(Hna), the lattice grids 3N(5), and the relations (8.3.15) and
(8.3.16), respectively.
We shall suppose below that the points X}'X2, ... at which the regression function f is
evaluated are random. In this case the most popular linear parametric estimation methods
316 Chapter 8
are the least square and ordinary Monte Carlo. The least square method minimizes the
random inaccuracy V, but introduces a bias leading to a slight increase of the bias
inaccuracy B. Usually the increase of B is insignificant from the asymptotic point of view
and thus the least square method may usually be included into an asymptotically optimal
projection estimation procedure. The drawback of the least square estimation method is its
numerical complexity. The standard Monte Carlo method is much simpler than the least
square approach and produces unbiased estimates of the Fourier coefficients. This way,
its use gives the minimal value of the bias inaccuracy B, i.e.
(8.3.17)
On the other hand, the random inaccuracy V is larger than for the least square method. We
shall show that the increase of V is often insignificant from the asymptotic point of view
and thus the Monte Carlo estimation method may also be included into the asymptotically
optimal projection estimation procedure.
Let x 1> ..• ,xN be independent and identically distributed with a positive (on X) density
p(x) and {f 1,... ,f m} be a base of Lm· The standard Monte Carlo estimates of the Fourier
coefficients
e. = rf(x)f .(x)v(dx),
1 1
i = 1, ... ,m,
are of the form
(8.3.18)
Set
1/2
g/x) = (f2(x) + a 2(x») f /x),
Lemma 8.3.2. Let xI, ... ,xN be independent and distributed according to a positive
probability density p on X, the evaluation errors S(Xj)=y(xj)-f(xj) be uncorrelated, their
variation var(s(x»=a2 (x) be bounded. Then the estimates (8.3.18) are unbiased and
~ yields (8.3.17) for the bias inaccuracy B. Consider now the random error V:
2 2
V =V (p)=Ef
(rnI,8·f·(x)- I,8.f.(x)
rn,,} (dx)=
i =1 1 1 i=1 1 1
=
.
I E(~. -
1
8.
1
)(8 - 8 )f f .(x)f
~ ~ 1 ~
(x)v(dx) =
l,~=1
f
= N-1 tr D(p) F(x)F' (x)v(dx),
where
is the normalized covariance matrix representable in the form (8.2.3) with the substitution
8 for't. Hence, it follows that the optimal choice criterion for the density p is linear and
has the form (8.2.8) with
L= fF(x)F'(x)v(dx);
and the optimal density has the form (8.2.24). The minimal value of the squared random
inaccuracy is attained at this density and equals
(8.3.20)
Due to Theorem 8.3.1, (8.3.20) is not less than cm/N, where c is a positive number. At
the same time, in a rather general case, there exist a density p and a constant C~l such that
(8.3.21)
The inequality (8.3.21) is valid, for instance, for the cases considered in Section 8.3.3
and in the common case (in optimization theory) when v(,X)<oo, functions f(x) and a 2(x)
are bounded and the base functions f i (i=l, ... ,m) are v-orthonormal and uniformly
bounded with respect to i. In the latter case, the uniform (on 'X) density can be chosen as
p. It follows that for orthonormal functions f 1,... ,f m there holds
2
V (p)=N- 1(((2
f f (x)+a 2 (x)
)rn2}} rn 2
i~/i(X) p(x) (dx)- i~18i
is valid, i.e. the random error reaches the optimal order of decrease for m, N~oo. Taking
into account that the bias inaccuracy is minimal we get the following statement.
Theorem 8.3.3. Let (8.3.21) be fulfilled for some constant C~l and positive density p
on 'X. Then the independent random choice of points xl ,... ,xN in accordance with the
density p, the method of parametric estimation (8.3.18), and a suitable choice of m=m(N)
(these leading to equal decrease orders of:E2CF,Lm) and (8.3.22» form an asymptotically
optimal procedure of projection estimation of a regression function f E f'.
As indicated above, for the functional classes f' and sequences of spaces (Lm}
considered in Section 8.3.3 the condition (8.3.21) holds, in particular, for the uniform
density on 'X. So the projection estimation procedure of Theorem 8.3.3 with the uniform
density is asymptotically optimal. Note also that the above random designs have
advantages, compared with the deterministic ones given in Section 8.3.3, viz., they are
simpler to construct and they posses the composite property described in Section 2.2.1.
Let us turn now to the construction and study of the least square parametric estimation
through regression function evaluations at independent random points.
Let m be fixed, the base functions f 1, ...,f m be v-orthonormal, the points x b ... ,xN at
which a regression function is being evaluated be random, independent and identically
distributed with a density p (with respect to the measure v) which is v-almost everywhere
positive (p>O (modv), fpdv=l) on 'X, the evaluations of f be uncorrelated and their
variance a 2(x)=ES2(x) be uniformly bounded.
Denote by 8=(81 ,... 8 m )' the vector of Fourier coefficients of f with respect to the
base functions f b.··,f m and set
The estimation of the unknown parameters (i.e. the Fourier coefficients 8i) by the least
square method is as follows. Supposing that f is a linear combination of the functions
f 1o···,f m let us derive the simultaneous equations
m
I8ofo(xo) =y(xo), j = 1, ... ,N,
i=1 1 1 J J
mUltiply now the i-th equation by (p(Xj)-I/2 and obtain A8=Y, where
Problems with Global Random Search 319
Suppose now that the matrix A'A is nondegenerate. Then the least square estimate of 8
will have the form
A -1
8=(A'A) A'Y. (8.3.23)
converges for N--+oo to the unit matrix 1m; the order of convergence rate equals O(N-1!2).
It follows that in case of existence of the inverse matrix (A'A)-l, the asymptotic relation
N --+00 (8.3.24)
is valid.
Consider now the case of a v-regular system of functions f 1,... ,f m with respect to the
measure v, i.e. such a system that for all N~m the measure of the point set {Xl ,... ,xN }
for which the matrix A'A is degenerate equals zero.
Theorem 8.3.4. Let f, fiE L2('X.,v), i=l, ... ,m, m be fixed, N--+oo , and the collection
of v-orthonormal functions f 1,... ,f m be regular. Then for the estimate (8.3.23) the
asymptotic representations
Let us comment on the above assertion. First, if for the inverse matrix (A'A)-1 in
(8.3.23) its initial approximation N-1I m is substituted, then the standard Monte Carlo
estimate (8.3.18) is obtained. Second, ifp is chosen as
m 2
p(x) = m- 1 L f .(x), (8.3.27)
i=1 1
A
EO=O+O (
N -2) , N ~oo.
It is not difficult to show that for the density (8.3.27), the exact unbiasedness
A
EO=O
also takes place. Third, analogously with the case when the Fourier coefficients are
estimated by standard Monte Carlo method, the minimzation problem of the convex
functional of the matrix
+ (J"2(x) (-1)
Jr2(x)p(x) A
F(x)F' (x)v(dx) = N covO + 0 N , N~oo,
with respect to p can be stated. Ignoring the biasedness of the estimates obtained, the
results will be similar to those stated in Section 8.2.
Note finally that if the function collection U1,... ,f m} is v-regular, N=km, the so-
called Ermakov-Zolotukhin points that have the joint density (see Ermakov (1975))
for k groups of m points, and the least square estimates of the Fourier coefficients of f
with respect to h,... ,f m are used, then by virtue of results in Ermakov (1975) one has
(N ~oo, m ~oo)
for the summed inaccuracy (8.3.7), for details see Zhigljavsky (1985). Comparing this
estimate with (8.3.14), we obtain that the Ermakov-Zolotukhin points can not be generally
included into an asymptotically optimal projection estimation procedure.
REFERENCES
Aluffi-Pentini F., Parisi V. and Zirilli F. (1985) Global optimization and stochastic
differential equations. J. Optimiz. Theory and Applic., 47, No.1, 1-16.
Anderson A. and Walsh G.R. (1986) A graphical method for a class of Branin trajectories.
J. Optimiz. Theory and Applic., 49, No.3,367-374.
Anderssen R.S. and Bloomfield P. (1975) Properties of the random search in global
optimization. J. Optimiz. Theory and Applic., 16, No.5/6, 383-398.
Anily S. and Federgruen A. (1987) Simulated annealing methods with general acceptance
probabilities. J. Appl. Probab., 24, No.3, 657-667.
Archetti F. and Betro B. (1979) A probabilistic algorithm for global optimization. Calcolo,
16, No.3, 335-343.
Archetti F. and Betro B. (1980) Stochastic models and optimization. Bolletino della
Unione Matematica Italiana, 17-A, No.5, 225-301.
Archetti F. and Schoen F. (1984) A survey on the global optimization problem: general
theory and computational approaches. Ann. Oper. Res., 1,87-110.
Ariyawansa K.A. and Templeton J.G.C. (1983) On statistical control of optimization.
Optimization, 14, No.2, 393-410.
Avriel M. (1976) Nonlinear Programming: Analysis and Methods, Prentice-Hall,
Englewood Cliffs, New Jersey e.a.
Baba N. (1981) Convergence of a random oprimization method for constrained
optimization problems. J. Optimiz. Theory and Applic., 1981,33, No.4, 451-461.
Basso P. (1982) Iterative methods for the localization of the global maximum. SIAM J.
Numer. Anal., 19, No.4, 781-792.
Bates D. (1983) The derivative of X'X and its uses. Technometrics, 25, No.4, 373-376.
Batishev D.l. (1975) Search Methods of Optimal Construction. Soviet Radio, Moscow,
216 p. (in Russian).
Batishev D.l. and Lyubomirov A.M. (1985) Application of pattern recognition methods to
searching for a global minimum of a multivariate function. Problems of Cybernetics,
Vo1.122, (ed. V.V.Fedorov), 46-60 (in Russian).
Beale E.M.L. and Forrest J.J.H. (1978) Global optimization as an extension of integer
programming. In: Towards Global Optimization Vol.2. North Holland, Amsterdam e.a.,
131-149.
Bekey G.A. and Ung M.T. (1974) A comparative evaluation of two global search
algorithms. IEEE Trans. on Systems, Man and Cybernetics, No.1, 112-116.
321
322 References
Betro B. and Schoen F. (1987) Sequential stopping rules for the multistart algorithm in
global optimization. Mathern. Programming, 38, No.2, 271-286.
Betro B. and Vercellis C. (1986) Bayesian nonparametric inference and Monte Carlo
optimization. Optimization, 17, No.5, 681-694.
Betro B. and Zielinski R. (1987) A Monte Carlo study of a Bayesian decision rule
concerning the number of different values of a discrete random variable. Commun. Statist.:
Simulation, 16, No.4, 925-938.
Boender C.G.E., and Zielinski R. (1985) A sequential Bayesian approach to estimating the
dimension of a multinomial distribution. In: Sequential Methods in Statistics. Banach
Center Publications, Vo1.16, Polish Scientific Publishers, Warsaw, 37-42.
Boender G., Rinnoy Kan A., Stougie L., and Timmer G. (1982) A stochastic method for
global optimization. Mathern. Programming, 22, No.1, 125-140.
Boender, C.G.E. (1984) The generalized multinomial distribution: A Bayesien analysis and
applications. Ph. D. Thesis, Erasmus University, Rotterdam.
Boender, C.G.E., and Rinnoy Kan A.R.G. (1987) Bayesian stopping rules for multistart
global optimization methods. Mathern. Programming, 37, No.1, 59-80.
Bohachevsky I.O., Johnson M.E., and Stein M.L. (1986) Generalized simulated annealing
for function optimization. Technometrics, 28, No.3, 209-217.
Branin F.R. (1972) A widely convergent method for finding multiple solutions of
simultaneous non-linear equations. mM J. Res. Develop., 16,504-522.
Branin F.R., and Roo S.K. (1972) A method for finding multiple extrema of a function of
n variables. In: Numerical Methods for Nonlinear Optimization. Academic Press, London
e.a., 231-237.
References 323
Brooks S.H. (1958) Discussion of random methods for locating surface maxima. Oper.
Res., 6, 244-251.
Chen J., and Rubin H. (1986) Drawing a random sample from a density selected at
random. Comput. Statist. and Data Analys. 4, No.4, 219-227.
Chernousko F.L. (1970) On the optimal search of extremum for unimodal functions.
USSR Comput. Mathern. and Mathern. Phys., 10,No.4, 922-933.
Chichinadze V.K. (1967) Random search to determine the extremum of a function of
several variables. Engineering Cybernetics, No.5, 115-123.
Chichinadze V.K. (1969) The '¥-transform for solving linear and nonlinear programming
problems. Automatica, 5, No.3, 347-355.
Chuyan O.R. (1986) Optimal one-step maximization of twice differentiable functions.
USSR Comput. Mathern. and Mathern. Phys., 26, No.3, 381-397.
Clough D.J. (1969) An asymptotic extreme value sampling theory for the estimation of a
global maximum. Canad. Oper. Res. Soc. J. 7, No.1, 105-115.
Cohen J.P. (1986) Large sample theory for fitting an approximate Gumbel model to
maxima. Sankhya, A48, 372-392.
Cook P. (1979) Statistical inference for bounds of random variables. Biometrika, 66,
No.2, 367-374.
Cook P. (1980) Optimal linear estimation of bounds ofrandom variables. Biometrika, 67,
No.1, 257-258.
Corana A., Marchesi M., Martini c., and Ridella S. (1987) Minimizing multimodal
functions of continuous variables with the simulated annealing algorithm. ACM Trans. on
Mathern. Software, 13, No.3, 262-280.
Corles C. (1975) The use ofregions of attractions to identify global minima. In: Towards
Global Optimization. VoU, North Holland, Amsterdam e.a., 55-95.
324 References
Crowder H.P., Dembo R.S. and Mulvey J.M. (1978) Reporting computational
experiments in mathematical programming. Mathern. Programming, 15,316-329.
Csendes T. (1985) A simple but hard-to-solve global optimization test problem. Presented
at the IIASA Workshop on Global Optimization (held in Sopron, Hungary).
Csorgo S. and Mason D.M. (1989) Simple estimators of the endpoint of a distribution. In:
Extreme Value Theory, Oberwolfach, 1987 (eds.Hiisler J. and Reiss R.-D.) Lecture Notes
in Statistics, Springer, Berlin e.a.
Csorgo S., Deheuvels P. and Mason D.N. (1985) Kernel estimates of the tail index of a
distribution. Ann. Statist., 13, No.3, 1050-1077.
Dannenbring D.G. (1977) Procedures for estimating optimal solution values for large
combinatorial problems. - Management Science, 23, 1273-1283.
de Biase L. and Frontini F. (1978) A stochastic method for global optimization: its
structure and numerical performance. In: Towards Global Optimization Vo1.2. North
Holland, Amsterdam e.a., 85-102.
de Haan L. (1970) On Regular Variation and its Application to the Weak Convergence of
Sample Extremes. North Holland, Amsterdam e.a., 104 p.
de Haan L. (1981) Estimation of the minimum of a function using order statistics, J. Amer.
Statist. Assoc., 76, No.374, 467-475.
Dekkers A.L.M. and de Haan L. (1987) On a consistent estimate of the index of an extreme
value distribution. Rept. Cent. Math. and Comput. Sci., No. MS - R871O, 1-15.
Demidenko E.Z. (1989) Optimization and Regression. Nauka, Moscow (in Russian).
Dennis J.E. and Schnabel R.B. (1983) Numerical Methods for Unconstrained Optimization
and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, New Jersey e.a.
Devroye L. (1986) Nonuniform Random Variate Generation. Springer, Berlin e.a., 843 p.
Devroye L. and Gyorfi L. (1985) Nonparametric Density Estimation: the Ll View. Wiley,
New York e.a.
Dixon L.C.W. and Szego G.P., eds (1975) Towards Global Optimization, VoLl. North
Holland, Amsterdam e.a., 472 p.
Dixon L.C.W. and Szego G.P., eds (1978) Towards Global Optimization, Vol.2. North
Holland, Amsterdam e.a., 364 p.
Dorea C.C.Y. (1986) Limiting distribution for random optimization methods. SIAM J.
Control and Optimization, 24, No.1, 76-82.
Dorea C.C.Y. (1987) Estimation of the extreme value and the extreme points. Ann. Inst.
Statist. Mathern., 39, No.1, 37-48.
Dunford N. and Schwartz J.T. (1958) Linear Operators, Part 1.: General Theory,
Interscience Publishers, New York e.a.
Duran B.S., and Odell P.L. (1977) Cluster Analysis: A Survey. Springer, Berlin e.a.
Ermakov S.M. (1975) Monte Carlo Methods and Related Problems. Nauka, Moscow, 472
p. (in Russian).
Evtushenko Yu.G. (1971) Algorithm for finding the global extremum ofa function (case of
a non-uniform mesh). USSR Comput. Mathern. and Mathern Phys., 11, No.6, 1390-
1403.
Falk M. (1983) Rates of uniform convergence of extreme order statistics. Ann. Instit.
Statist. Mathern. 38, part A, No.2, 245-265.
Fedorov V.V. (1979) Numerical Maximin Methods. Nauka, Moscow, 280 p. (in Russian).
Fedorov V.V., ed. (1985) Problems of Cybernetics, No. 122: Models and Methods of
Global Optimization. USSR Academy of Sciences, Moscow (in Russian).
Galambos J. (1978) The Asymptotic Theory of Extreme Order Statistics. Wiley, New York
e.a.
Galperin E.A. (1988) Precision, complexity and computational schemes of the cubic
algorithm. J. Optimiz. Theory and Applic., 57, No.2, 223-238.
Galperin E.A. and Zheng Q. (1987) Nonlinear observation via global optimization
methods: measure theory approach. 1. Optimiz. Theory and Applic., 54, No.1, 63-92.
Ganshin G.S. (1976) Calculation of the maximal value of functions. USSR Comput.
Mathern. and Mathern. Phys., 16, No.1, 30-39.
Ganshin G.S. (1977) Optimal algorithms of calculation of the highest value of functions.
USSR Comput. Mathern. and Mathern. Phys., 17, No.3, 562-571.
Ganshin G.S. (1979) The optimization of algorithms on classes of nets. USSR Comput.
Mathern. and Mathern. Phys., 19, No.14, 811-821.
Gaviano M. (1975) Some general results on the convergence of random search algorithms
in minimization problems. In: Towards Global Optimization, VoU, North Holland,
Amsterdam e.a., 149-157.
Ge R.P. (1983) A filled function method for finding a global minimizer. Presented at the
Dundee Biennial Conference on Numerical Analisys.
Ge R.P. and Qin Y.F. (1987) A class of filled functions for finding global minimizers of a
function of several variables. J. Optimiz. Theory and Applic., 54, No.2, 241-252.
References 327
Geman S. and Hwang c.-R. (1986) Diffusions for global optimization. SIAM J. Control
and Optimization, 24, No.5, 1031-1043.
Golden B.L. and Alt F.B. (1979) Interval estimation of a global optimum for large
combinatorial problems. Naval Res. Logistics Quaterly, 26, No.1, 69-77.
Griewank A.O. (1981) Generalized descent for global optimization. J. Optimiz. Theory
and Applic., 34, No.1, 11-39.
Haines L.M. (1987) The application of the annealing algorithm to the construction of exact
optimal design for linear regression models. Technometrics, 29, No.4, 439-448.
Hall P. (1982) On estimating the endpoint of a distribution. Ann. Statist., 10, No.2, 556-
568.
Hansen E. (1979) Global optimization using interval analysis: the one-dimensional case. J.
Optimiz.Theory and Applic., 29, No.3, 331-334.
Hansen E. (1980) Global optimization using interval analysis: the multidimensional case.
Numer. Math., 34, 247-270.
Hansen E. (1984) Global optimization with data perturbations. Comput. Oper. Res., 11,
No.2, 97-104.
Harris T.E. (1963) The Theory of Branching Processes. Springer, Berlin e.a.
Hartley H.O. and Ruud P.G. (1969) Computer optimization of second order response
surface designs. In: Statistical Computations, Proceedings of a Conference held at the
University of Wisconsin, Madison, Wisconsin, April 29-30, 1969,441-462.
Hock W. and Schittkowski K. (1981) Test Examples for Nonlinear Programming Codes.
Lecture Notes in Economics and Mathematical Systems, No.187, Springer, Berlin e.a.,
177 p.
Hua L.K. and Wang Y. (1981) Applications of Number Theory to Numerical Analysis.
Springer, Berlin e.a.
Ichida K. and Fujii Y. (1979) An interval arithmetic method for global optimization.
Computing, 23, 85-97.
Incerti S., Parisi V. and Zirilli F. (1979) A new method for solving nonlinear simultaneous
equations. SIAM J. on Numerical Analysis, 16,779-789.
Isaac R. (1988) A limit theorem for probabilities related to the random bombardment of a
square. Acta Mathematica Hungarica, 51, No.I-2, 85-97.
Ivanov V.V. (1972) On Optimal Algorithms of minimization in the class of functions with
the Lipschitz condition, Information Processing, 2, p. 1324-1327.
Ivanov V.V., Girlin S.K. and Ludvichenko V.A. (1985) Problems and results of global
optimization for smoothing functions. Problems of Cybernetics, Vo1.122.(ed.Fedorov
V.V.), 3-13 (in Russian).
Jacobsen S. and Torabi M. (1978) A global minimization algorithm for a class of one-
dimensional functions. J. of Mathern. Analysis and Applic., 62, 310-324.
Janson S. (1987) Maximal spacings in several dimensions. Ann. Probab., 15, No.1, 274-
280.
Kantorovich L.V. and Akilov G.P. (1977) Functional Analysis. Nauka, Moscow, 744 p.
(in Russian).
Kashtanov, Yu. N. (1987) The rate of convergence towards its own distribution of an
integral operator in the generation method. Vestnik of Leningrad University, No.1, 17-21.
Katkovnik V.Ya. (1976) Linear Estimators and Stochastic Optimization Problems. Nauka,
Moscow, 488 p. (in Russian).
Khairullin R.H. (1980) On the estimation of the critical parameter of the branching
processes of a special kind. Izvestiya Vuzov, Matematika. No.8, 77-84 (in Russian).
Khasminsky R.Z. (1965) Application of random noise in optimization and recognition
problems. Problems ofInformation Transfer. 1, No.3, 113-117 (in Russian).
References 329
Kiefer J. (1953) Sequential minimax search for a maximum. Proc. Amer. Math. Soc., 4,
No.3, 502-506.
Kirkpatrick S., Gelatt C.D. and Vecchi M.P. (1983) Optimization by simulated annealing.
Science, 220, pp. 671-680.
Kolesnikova C.N., Kornikov V.V., Rozhkov N.N. and Khovanov N.V. (1987)
Stochastic processes with equiprobable monotone realizations as models of information
deficiency. Vestnik of Leningrad University, No.1, 21-26.
Korjakin A.I. (1983) The estimation of a function from randomized observations. USSR
Comput. Mathern. and Mathern. Phys., 23, No.1, 21-28.
Kushner H.J. (1964) A new method of locating the maximum point of an arbitrary
multipeak curve in the presence of noise. Trans.ASME, ser. D, 86, No.1, 97-105.
Kushner H.J. (1987) Asymptotic global behavior of stochastic approximation and
diffusion with slowly descreasing noise effects: global minimization via Monte Carlo.
SIAM J. Appl. Math., 47, No.1, 169-184.
Lbov G.S. (1972) Training for extremum determination of a function of variables
measured on nominal names scale. In: Second Intern. Conf. on Artificial Intelligence.
London, 418-423.
Levin V.L. (1985) Convex Analysis in Spaces of Measurable Functions and Applications
in Mathematics and Economics. Nauka, Moscow, 316 p. (in Russian).
Levy A.V. and Montalvo A. (1985) The tunneling algorithm for the global minimization of
functions. SIAM J. Sci. Statist. Comput., 6, No.1, 15-29.
Lindgren G. and Rootzen H. (1987) Extreme values: Theory and technical applications.
Scand. J. Statist., 14, No.4, 241-279.
Loeve M. (1963) Probability Theory. D. Van Nostrand, New York e.a.
Low W. (1984) Estimating an endpoint of a distribution with resampling methods. Ann.
Statist., 12, No.4, 1543-1550.
Makarov I.M., Radashevich Yu. B. (1981) Statistical estimation of accuracy in constrained
extremal problems. Automation and Remote Control, No.3, 41-48.
Mancini L.J. and McCormick G.P. (1979) Bounding global minima with interval
arithmetic. Oper. Res. 27, No.4, 743-754.
330 References
Mancini L.J., and McCormick G.P. (1976) Bounding global minima. Mathern. of Oper.
Res., 1, No.1, 50-53.
Mandal N.K. (1981) D-optimal designs for estimating the optimum point in a multifactor
experiment. Calcutta Statist. Assoc. Bull., 30, No.119-120, 145-169.
MandaI N.K. (1989) D-optimal designs for estimating the optimum point in a quadratic
response surface - rectangular region. J. Statistical Planning and Inference, 23, 243-252.
McMurty G.J. and Fu K.S. (1966) A variable structure automaton used as a multimodal
searching technique. IEEE Trans. on Automatic Control, 11, No.3, 379-387.
Meewella C.C., and Mayne D.Q. (1988) An algorithm for global optimization of Lipschitz
continuous functions. J. Optimiz. Theory and Applic., 57, No.2, 307-322.
Metropolis N., Rosenbluth A.W., Rosenbluth M.N. and Teller A.H. (1953) Equations of
state calculations by fast computing machines. J. Chern. Phys., 21, 1087-1091.
Mikhailov G.A. (1966) Calculation of critical systems by the Monte Carlo method. USSR
Comput. Mathern. and Mathern. Phys., 6,No.l, 71-80.
Mikhailov G.A. (1984) Minimax theory of weighted Monte Carlo methods, USSR
Comput. Mathern. and Mathern. Phys., 24, No.9, 1294-1302.
Mikhailov G.A. (1987) Optimizing Weighted Monte Carlo Methods. Nauka, Moscow, 240
p. (in Russian).
Mikhailov G.A., Zhigljavsky A.A. (1988) Uniform optimization of weighted Monte Carlo
estimates. Dokl. Akad. Nauk SSSR, 303, No.2, 290-293.
Mikhalevich V.S., Gupal A.M. and Norkin V.I. (1987) Methods of Nonsmooth
Optimization. Nauka, Moscow (in Russian).
Moore R.E. (1966) Interval Analysis. Prentice-Hall, Englewood Cliffs, New Jersey e.a.
Myers R.H. and Khuri A.I. (1979) A new procedure for steepest ascent. Commun.
Statist., A8, No.14, 1359-1376.
Nefedov V.N. (1987) Searching the global maximum of a function with several arguments
on a set given by inequalities. USSR Comput. Mathern. and Mathern. Physics, 27, Nol,
35-51.
Nelder,J.A. and Mead,R. (1965) A simplex method for functional minimization. Computer
Journal 7,308-313.
Neveu J. (1964) Bases Mathematiques du Calcul des ProbabiIites. Masson et Cie, Paris.
Niederreiter H. (1978) Quasi-Monte-Carlo methods and pseudorandom numbers. Bull.
Amer. Mathern. Soc., 84, N 6, 957-1041.
Niederreiter H. (1986) Low-discrepancy point sets. Monatsh. Mathern. 102, N 2, 155-
167.
Niederreiter H. and McCurley M. (1979) Optimization of functions by quasi-random
search methods. Computing, 22, 119-123.
Niederreiter H. and Peart P. (1982) A comparative study of quasi-Monte Carlo methods for
optimization of functions of several variables, Caribbean J. Math., 1, 27-44.
Pardalos,P.M. and Rosen,J.B. (1987) Constrained global optimizations algorithms and
applications. Lecture Notes in Computer Science 268, Springer-Verlag, Berlin Heidelberg
New York.
Pickands J. III. (1975) Statistical inference using extreme order statistics. Ann. Statist., 3,
No.1, 119-131.
Pinkus M. (1968) A closed form solution of certain programming problems. Oper. Res.
16, 690-694.
332 References
Pinter J. (1988) Branch-and-bourd methods for solving global optimization problems with
Lipschitzian structure. Optimization 19, No.1, 101-110.
Piyavskii, S.A. (1967) An algorithm for finding the absolute minimum of a function.
Theory of Optimal Solutions, No.2. Kiev, IK AN USSR, pp. 13-24. (In Russian).
Piyavsky, S.A. (1972) An algorithm for finding the absolute extremum of a function.
USSR Comput. Mathem. and Mathem. Phys., 12, No.4, 888-896.
Price W.L. (1983) Global optimization by controlled random search. J. Optimiz. Theory
and Applic., 40, No.2, 333-348.
Price W.L. (1987) GI,obal optimization algorithms for a CAD workstation. J. Optimiz.
Theory and Applic., 55, No.1, 133-146.
Pronzato L., Walter E., Venot A. and Lebrichec J.P. (1984) A general purpose global
optimizer: Implementation and applications. Mathem. Comput. Simul., 26, 412-422.
Pshenichny B.N. and Marchenko D.L (1967) On an approach to evaluation of the global
minimum. Theory of Optimal Decisions, Vol. 2. Kiev, 3-12 (in Russian).
Rastrigin L.A. (1968) Statistical Search Methods. Nauka, Moscow, 376 p. (in Russian).
Ratschek, H. and Rokra, J. (1984) Computer methods for the range of functions. Ellis
Harwood-Wiley, New York.
References 333
Resnick S.l. (1987) Extreme Values, Regular Variation and Point Processes. Springer,
Berlin e.a.
Revus D. (1975) Markov Chains. North Holland, Amsterdam e.a., 334 p.
Rinnooy Kan A.H.G and Timmer G.T. (1987a) Stochastic global optimization methods. I.
Clustering methods. Mathern. Programming, 39, No.1, 27-56.
Rinnooy Kan A.H.G. and Timmer G.T. (1985) A stochastic approach to global
optimization. In:Numerical Optimization, 1984. SIAM, Philadelphia, Pa., 245-262.
Rinnooy Kan A.H.G. and Timmer G.T. (1987b) Stochastic global optimization methods.
II. Multilevel methods. Mathern. Programming, 39, No.1, 5778.
Robson D.S. and Whitlock J.H. (1964) Estimation of a truncation point. Biometrika, 51,
No.1, 33-39.
Rosen J.B. (1983) Global minimization of a linearly constrained concave function by
partition of the feasible domain. Mathern. Res., 8, 215-230.
Sager T. (1983) Estimating modes and isoplets. Commun. Statist.: Theory and Math., 12,
No.5, 529-557.
Saltenis (1989) Analysis of optimization problem structure. Mokslas, Vilnius (in Russian).
Schoen F. (1982) On a sequential search strategy in global optimization problems. Calcolo,
19, No.3, 321-334.
Schwartz L. (1967) Analyse Mathematique, V.1, Hermann, Paris.
Seneta E.(1976) Regularly-varying functions. Lecture Notes in Mathern., 508, Springer,
Berlin e.a.
Shen Z. and Zhu Y. (1987) An interval version of Shubert's iterative method for the
localization of the global maximum. Computing. 38,275-280.
Shubert B.O. (1972) A sequential method seeking the global maximum of a function.
SIAM 1. on Numer. Analysis, 9, No.3, 379-388.
Silverman B.W. (1983) Some properties of a test for multimodality based on kernel density
estimates. London Mathern. Soc. Lecture Note Ser., No.79, 248-259.
Smith R.L. (1982) Uniform rates of convergence in extreme value theory. Adv.Appl.
Probab., 14, 600-622.
Smith R.L. (1987) Estimating tails of probability distributions. Ann. Statist., 15, No.3,
1174-1207.
334 References
Snyman I.A. and Fatti L.P. (1987) A multistart global minimization algorithm with
dynamic search trajectories. I. Optimiz. Theory and Applic., 54, No.1, 121-141.
Sobol I.M. (1969) Multivariate Quadrature Formulas and Haar Functions. Nauka,
Moscow, 288 p. (in Russian).
Sobol I.M. (1982) On an estimate of the accuracy of a simple multidimensional search.
Dokl. Akad. Nauk SSSR, 266, 569-572 (in Russian).
Sobol I.M. and Statnikov R.B. (1981) Optimal Choice of Parameters in Multicriteria
Problems. Nauka, Moscow, 110 p. (in Russian).
Solis F and Wets R. (1981) Minimization by random search techniques. Mathern. Oper.
Res., 6, No.1, 19-30.
Spircu L. (1978) Cluster analyisis in global optimization. Economic Computation and
Economic Cybernetic Studies and Research, 13,43-50.
Strongin R.G. (1978) Numerical Methods in Multiextremal Optimization. Nauka,
Moscow, 240 p. (in Russian).
Sukharev A.G. (1971) On optimal strategies for an extremum search. USSR
Comput.Mathem. and Mathern. Phys., 11, N. 4, 910-924.
Sukharev A.G. (1975) Optimal Search of the Extremum. Moscow University Press. 100
p. (in Russian).
Tom, A.A. and Zilinskas, A. (1989) Global Optimization. Lecture Notes in Computer
Science 350 Springer-Verlag, Berlin Heidelberg New York.
Turchin V.F. (1971) On the calculation of multivariate integrals by the Monte Carlo
method, Theory of Probab. and Applic., 16, No.4, 738-741.
References 335
van Laarhoven P.J.M. and Aarts E.H.L. (1987) Simulated Annealing: Theory and
Applications, Kluwer Academic Publishers, Dordrecht e.a., 198 p.
Vasiliev F.P. (1981) Methods for Solving Extremal Problems. Nauka, Moscow, 400 p. (in
Russian).
Vilkov A.V., Zhidkov N.P and Shchedrin B.M. (1975) A method for finding the global
minimum of a function of one variable. USSSR Comput. Mathern. and Mathern Phys., 15,
No.4, 1040-1042.
Walster G.W., Hansen E.R. and Sengupta S (1985) Test results for a global optimization
algorithm. In: Numerical Optimization (eds. Boggs P.T. and Byrd R.H.), SIAM, New
York 272-287.
Watt V.P. (1980) A note on estimation of bounds of random variables. Biometrika, 67,
No.3, 712-714.
Weisman I. (1981) Confidence intervals for the threshold parameter. Commun. Statist.,
AlO, No.6, 549-557.
Weisman I. (1982) Confidence intervals for the threshold parameter II: unknown shape
parameter. Commun. Statist.: Theory and Meth., 21,2451-2474.
Weiss L. (1971) Asymptotic inference about a density function at the end of its range.
Naval Res. Logistic Quaterly, 18, No.1, 111-115.
Wood G.R. (1985) Multidimensional bisection and global minimization. Presented at the
IIASA Workshop on Global Optimization. (Held in Sopron, Hungary).
Yakowitz S.J. and Fisher L. (1973) On sequential search for the maximum of an unknown
function. J. Math. Anal. and Appl., 41, No.1, 234-259.
Yamashita H. (1979) A continuous path method of optimization and its application to
global optimization. In: Survey of Mathematical Programming, VoLl/A, Budapest, 539-
546.
Zaliznyak N.F. and Ligun A.A. (1978) On optimum strategy in search of global maximum
of function. USSR Comput Mathern. and Mathern. Phys., 18, No.2, 314-321.
Zanakis S.H and Evans J.R. (1981) Heuristic optimization: why, when and how to use it.
Interfaces, 11, No.5, 84-91.
Zaitg I. and Avriel M. (1975) On functions whose local minima are global. J. Optimiz.
Theory and Applic., 16, No.3/4, 183-190.
336 References
Zhidkov N.P. and Schedrin B.M. (1968) A certain method of search for the minimum of a
function of several variables. Computing Methods and Programming, Moscow University
Press, V.lO., 203-210 (in Russian).
Zhigljavsky A.A. (1985) Mathematical Theory of Global Random Search. Leningrad
University Press, 296 p. (in Russian).
Zhigljavsky A.A. (1987) Monte Carlo methods for estimating functionals of the supremum
type. Doctoral's thesis. Leningrad University. (In Russian.)
Zhigljavsky A.A. and Terentyeva M.V. (1985) Statistical methods in global random
search. Testing of statistical hypotheses. Vestnik of Leningrad University, No.15, 89-91.
Zhigljavsky a.A. (1981) Investigation of probabilistic global optimization methods.
Candidate's Thesis. Leningrad University (in Russian).
Zhigljavsky A.A.(1988) Optimal designs for estimating several integrals. Optimal Design
and Analysis of Experiments (eds Y.Dodge, V.V. Fedorov and H.P. Wynn). North
Holland, Amsterdam e.a., 81-95.
Zielinski R. (1980) Global stochastic approximation: a review of results and some open
problems. In :Numerical Techniques for Stochastic Systems (eds Archetti F. and Cugiani
M.). North Holland, Amsterdam e.a., 379-386.
Zielinski R. (1981) A statistical estimate of the structure of multiextremal problems.
Mathern. Programming, 21, 348-356.
Zielinski R., and Neumann P. (1983) Stochastische Verfahren zur Suche nach dem
Minimum einer Funktion. Akademie-Verlag, Berlin, 136 p. .
Zilinskas A. (1978) Optimization of one-dimensional multimodal functions. Algorithm AS
133. Applied Statistics, 23, 367-375.
Zilinskas A. (1981) Two algorithms for one-dimensional multi modal minimization.
Optimization, 12, No.1, 53-63.
Zilinskas A. (1982) Axiomatic approach to statistical models and their use in multimodal
optimization theory. Mathern. Programming, 22, No.1, 104-116.
Zilinskas A. (1984) On justification of use of stochastic functions for multimodal
optimization models. Ann. Oper. Res. 1, 129-134.
Zilinskas A.G. (1986) Global Oprimization: Axiomatics of Statistical Models, Algorithms,
Application. Mokslas, Vilnius, 166 p. (in Russian).
SUBJECT INDEX
337
338 Subject Index
criterion
prospectiveness 147
cubic grid 40
cumulative distribution function neutral to the right 179
cyclic coordinate-wise descent 22
decreasing randomness 172
dependent sampling procedures 119
descent algorithm 21
deterministic algorithm 8
direct search algorithms
first order 22
second order 22
discrepancy 39
discrete optimization 233
discrete problems 3
dispersion 38
distribution sampling 219
distribution
extreme value 126
Gibbs 96
quasi-uniform 230
domain convergence 16
domination 168
E-optimal 72
estimate
optimal consistent 247
optimal consistent unbiased 247
estimator
asymptotically optimal linear 127
Evtushenko algorithm 46
experimental design 18
extremal experiment algorithm 285
extreme order statistics 240
extreme value distribution 126
feasible region 1
filled function 27
function
cumulative distribution 179
filled 27
homogeneous 124
tunneling 27
functional class 3
general statement of the optimization problem 4
generalized gradient algorithm 22
Gibbs distribution 96
global minimization method 1
global minimization problem 1
global minimizer 2
global stochastic approximation 99
gradient 4
gradient method 22
Subject Index 339
grid 35
algorithm 35
compisite 40
cubic 40
Hammersley-Holton 41
Lattice 41
II't 41
quasirandom 41
random 40
rectangular 40
stratified sample 42
guaranteed accuracy 49
Hammersley-Holton grid 41
heavy ball 31
Hessian 4
Holton sequence 41
homogeneous function 124
inaccuracy 16
index
tail 126
infinite dimensional 3
initial point 284
interval method 66
interval variables 67
Lattice grid 41
level surfaces 118
linear estimator 127
local minimizer 21
local optimization problem 21
local optimization theory 4
Markovian algorithm 93
Markovian property 93
mathematical programming 21
maximization problem 1
method of generations 190
with constant number of particles 204
method
random multi start 174
branch and bound 148
branch and probability bound 148
Branin's 32
candidate points 24
covering 35
interval 66
Metropolis 94
multi-level single linkage 26
nearest neighbour 25
polygonal line 45
Metropolis method 94
mode 118
340 Subject Index
random search
uniform 77
with learning 22
with uniform trial 22
randomized algorithm 51
rectangular grid 40
regression experiment 308
regression function 284
regression
nonparametric function estimation 308
Renyi representation 241
sample
stratified 156
screening experiment 61
search direction 21,284
search
coordinate-wise 22
random coordinate-wise 22
separable function 6
sequential covering 36
sequentially best algorithm 5 0
simulated annealing 95
simulation models 18
speed of convergence 16
steepest descent method 22
step length 21,284
stochastic
differential equation 98
global approximation 99
multiextremal approximation 99
stochastic approximation 284
stochastic programming 88
stratified sample 156
stratified sample grid 42
Strongin's algorithm 57
tail index 126,241
tunneling function 27
unbiasedness requirement 247
uniform random covering 83
uniform random search 77
upper bound 239
upper bound random variable 123
variable
upper bound random 123
variable-metric method 22
yearly maximum approach 239