0% found this document useful (0 votes)
140 views357 pages

Theory of Global Random Search

This book provides a theory of global random search methods for finding global extrema in complex optimization problems. The first part surveys existing methods for finding global extrema. The second and main part focuses on random search methods, which use simple implementations like random sampling to find global rather than just local extrema. While these methods are popular due to their stability, more theoretical work was needed on their underlying principles. The author has contributed significantly to developing the theoretical background of global random search methods.

Uploaded by

Hércules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views357 pages

Theory of Global Random Search

This book provides a theory of global random search methods for finding global extrema in complex optimization problems. The first part surveys existing methods for finding global extrema. The second and main part focuses on random search methods, which use simple implementations like random sampling to find global rather than just local extrema. While these methods are popular due to their stability, more theoretical work was needed on their underlying principles. The author has contributed significantly to developing the theoretical background of global random search methods.

Uploaded by

Hércules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 357

Theory of Global Random Search

Mathematics and Its Applications (Soviet Series)

Managing Editor:

M. HAZEWINKEL
Centre/or Mathematics and Computer Science, Amsterdam, The Netherlands

Editorial Board:

A. A. KIRILLOV, MGU, Moscow, U.S.S.R.


Yu. I. MANIN, Steklov Institute 0/ Mathematics, Moscow, U.S.S.R.
N. N. MOISEEV, Computing Centre, Academy o/Sciences, Moscow, U.S.S.R.
S. P. NOVIKOV, Landau Institute o/Theoretical Physics, Moscow, U.S.S.R.
M. C. POLYVANOV, Steklov Institute 0/ Mathematics, Moscow, U.S.S.R.
Yu. A. ROZANOV, Steklov Institute 0/ Mathematics, Moscow, U.S.S.R.

Vo!ume65
Theory of
Global Random Search
by

Anatoly A. Zhigljavsky
Leningrad University, U.S.S.R.

Edited by

J. Pinter

SPRINGER-SCIENCE+BUSINESS MEDIA, B.V.


Library of Congress Cataloging-in-Publication Data
Zhigllavskil, A. A. (Anatolil Aleksandrovich)
[Maternaticheskala teorila' global 'nogo sluchalnogo poiska.
Engl ishl
Theory of global randorn search I by Anatoly A. Zhigljavsky
edited by J. Pinter.
p. crn. -- (Mathematics and its applications. Soviet series
v. 65)
Rev. trans 1at i on of: Maternat i cheska la teor i la g 1oba 1 'nogo
sluchalnogo pOiska.
Includes bibliographical references and index.
ISBN 978-94-010-5519-2 ISBN 978-94-011-3436-1 (eBook)
DOI 10.1007/978-94-011-3436-1
1. Stochastic processes. 2. Search theDry. 3. Mathernatical
optimlzatlon. I. Plnter, J. II. Title. III. Serles: Mathernatics
and lts applications (Kluwer Academic Publishers). Soviet series ;
v. 65.
OA274.Z4813 1991
519.2--dc20 90-26821

ISBN 978-94-010-5519-2

Printed on acid-free paper

AlI Rights Reserved


© 1991 Springer Science+Business Media Dordrecht
Originally published byKluwer Academic Publishers in 1991
Softcover reprint ofthe hardcover Ist edition 1991
No part of the material protected by this copyright notice may be reproduced or
utilized in any form or by any means, electronic or mechanical,
including photocopying, recording or by any information storage and
retrieval system, without written permission from the copyright owner.
SERIES EDITOR'S PREFACE

'Et moi, ...• si j'avait su comment en revenir. One service mathematics has rendered the
je n'y serais point aIle.' human mee. It has put common sense back
Jules Verne where it belongs, on the topmost shelf next
to the dusty canister labelled 'discarded non-
The series is divergent; therefore we may be sense'.
able to do something with it. Eric T. Bell
O. Heaviside

Mathematics is a tool for thought. A highly necessary tool in a world where both feedback and non-
linearities abound. Similarly, all kinds of parts of mathematics serve as tools for other parts and for
other sciences.
Applying a simple rewriting rule to the quote on the right above one finds such statements as:
'One service topology has rendered mathematical physics .. .'; 'One service logic has rendered com-
puter science ...'; 'One service category theory has rendered mathematics .. .'. All arguably true. And
all statements obtainable this way form part of the raison d'etre of this series.
This series, Mathematics and Its Applications, started in 1977. Now that over one hundred
volumes have appeared it seems opportune to reexamine its scope. At the time I wrote
"Growing specialization and diversification have brought a host of monographs and
textbooks on increasingly specialized topics. However, the 'tree' of knowledge of
mathematics and related fields does not grow only by putting forth new branches. It
also happens, quite often in fact, that branches which were thought to be completely
disparate are suddenly seen to be related. Further, the kind and level of sophistication
of mathematics applied in various sciences has changed drastically in recent years:
measure theory is used (non-trivially) in regional and theoretical economics; algebraic
geometry interacts with physics; the Minkowsky lemma, coding theory and the structure
of water meet one another in packing and covering theory; quantum fields, crystal
defects and mathematical programming profit from homotopy theory; Lie algebras are
relevant to filtering; and prediction and electrical engineering can use Stein spaces. And
in addition to this there are such new emerging subdisciplines as 'experimental
mathematics', 'CFD', 'completely integrable systems', 'chaos, synergetics and large-scale
order', which are almost impossible to fit into the existing classification schemes. They
draw upon widely different sections of mathematics."
By and large, all this still applies today. It is still true that at first sight mathematics seems rather
fragmented and that to find, see, and exploit the deeper underlying interrelations more effort is
needed and so are books that can help mathematicians and scientists do so. Accordingly MIA will
continue to try to make such books available.
If anything, the description I gave in 1977 is now an understatement. To the examples of
interaction areas one should add string theory where Riemann surfaces, algebraic geometry, modu-
lar functions, knots, quantum field theory, Kac-Moody algebras, monstrous moonshine (and more)
all come together. And to the examples of things which can be usefully applied let me add the topic
'finite geometry'; a combination of words which sounds like it might not even exist, let alone be
applicable. And yet it is being applied: to statistics via designs, to radar/sonar detection arrays (via
finite projective planes), and to bus connections of VLSI chips (via difference sets). There seems to
be no part of (so-called pure) mathematics that is not in immediate danger of being applied. And,
accordingly, the applied mathematician needs to be aware of much more. Besides analysis and
numerics, the traditional workhorses, he may need all kinds of combinatorics, algebra, probability,
and so on.
In addition, the applied scientist needs to cope increasingly with the nonlinear world and the
extra mathematical sophistication that this requires. For that is where the rewards are. Linear
models are honest and a bit sad and depressing: proportional efforts and results. It is in the non-
vi SERIES EDITOR'S PREFACE

linear world that infinitesimal inputs may result in macroscopic outputs (or vice versa). To appreci-
ate what I am hinting at: if electronics were linear we would have no fun with transistors and com-
puters; we would have no TV; in fact you would not be reading these lines.
There is also no safety in ignoring such outlandish things as nonstandard analysis, superspace
and anticommuting integration, p-adic and ultrametric space. All three have applications in both
electrical engineering and physics. Once, complex numbers were equally outlandish, but they fre-
quently proved the shortest path between 'real' results. Similarly, the first two topics named have
already provided a number of 'wormhole' paths. There is no telling where all this is leading -
fortunately.
Thus the original scope of the series, which for various (sound) reasons now comprises five sub-
series: white (Japan), yellow (China), red (USSR), blue (Eastern Europe), and green (everything
else), still applies. It has been enlarged a bit to include books treating of the tools from one subdis-
cipline which are used in others. Thus the series still aims at books dealing with:
- a central concept which plays an important role in several different mathematical and I or
scientific specialization areas;
- new applications of the results and ideas from one area of scientific endeavour into another;
- influences which the results, problems and concepts of one field of enquiry have, and have had,
on the development of another.

A very large part of mathematics has to do with optimization in one form or another. There are
good theoretical reasons for that because to understand a phenomenon it is usually a good idea to
start with extremal cases, but in addition - if not predominantly - all kinds of optimization prob-
lems come directly from practical situations: How to pack a maximal number of spare parts in a
box? How to operate a hydroelectric plant optimally? How to travel most economically from one
place to another? How to minimize fuel consumption of an aeroplane? How to assign departure
gates in an aeroport optimally? etc., etc. This is perhaps also the area of mathematics which is most
visibly applicable. And it is, in fact, astonishing how much can be earned (or saved) by a mathemat-
ical analysis in many cases.
In complicated situations - and many practical ones are very complicated - there tend to be
many local extrema. Finding such a one is a basically well understood affair. It is a totally different
matter to find a global extremum. The first part of this book surveys and analyses the known
methods for doing this. The second and main part is concerned with the powerful technique of ran-
dom search methods for global extrema. This phrase describes a group of methods that have many
advantages - whence their popularity - such as simple implementation (also on parallel processor
machines) and stability (both with respect to perturbations and uncertainties) and some disadvan-
tages: principally relatively low speed and not nearly enough theoretical background results. In this
last direction the author has made fundamental and wide ranging contributions. Many of these
appear here for the first time in a larger integrated context.
The book addresses itself to both practitioners who want to use and implement random search
methods (and it explains when it may be wise to consider these methods), and to specialists in the
field who need an up-to-date authoritative survey of the field.
The shortest path between two truths in the Never lend books, for no one ever returns
real domain passes through the complex them; the only books I have in my library
domain. are books that other folk have lent me.
J. Hadamard Anatole France

La physique ne nous donne pas seulement The function of an expert is not to be more
l'occasion de resoudre des problemes ... eIle right than other people, but to be wrong for
nous fait pressentir la solution. more sophisticated reasons.
H. Poincare David Butler

Bussum, 18 February 1991 Michiel Hazewinkel


CONTENTS

FOREWORD ................................................................................. xiii

PREFACE ..................................................................................... xv

LIST OF BASIC NOTATIONS ............................................................ xvii

PART 1. GLOBAL OPTIMIZATION: AN OVERVIEW •••.•.............• 1

CHAPTER 1. GLOBAL OPTIMIZATION THEORY: GENERAL CONCEPTS ..... 1

1.1. Statements of the global optimization problem ................ , .............. , 1


1.2. Types of prior information about the objective function and a
classification of methods ......................................................... 5

1.2.1. Types of prior information .............................................. 5


1. 2. 2. Classification of principal approaches and methods of the
global optimization ....................................................... 8
1.2.3. General properties of multiextremal functions ........................ 9

1.3. Comparison and practical use of global optimization


algorithms ......................................................................... 15

1.3.1. Numerical comparison ................................................. 15


1.3.2. Theoretical comparison criteria ........................................ 16
1.3.3. Practical optimization problems ....................................... 18

CHAPTER 2. GLOBAL OPTIMIZATION METHODS .......................... , ...... 20

2.1. Global optimization algorithms based on the use of local


search techniques ................................................................ 20

2.1.1. Local optimization algorithms ......................................... 21


2.1.2. Use oflocal algorithms in constructing global
optimization strategies .................................................. 23
2.1.3. Multistart ................................................................. 24
2.1.4. Tunneling algorithms ................................................... 26
2.1.5. Methods of transition from one local minimizer
into another ............................................................. .30
2.1.6. Algorithms based on smoothing the objective function ........... .34

2.2. Set covering methods ........................................................... .35

2.2.1. Grid algorithms (Passive coverings) ................................. 36


2.2.2. Sequential covering methods ......................................... .43
2.2.3. Optimality of global minimization algorithms ...................... .48
viii CONTENTS

2.3. One-dimensional optimization, reduction and partition


techniques ........................................................................ 56

2.3.1. One- dimensional global optimization ................................. 56


2.3.2. Dimension reduction in multiextremal problems .................... 58
2.3.3. Reducing global optimization to other problems in
computational mathematics ............................................ 61
2.3.4. Branch and bound methods ........................................... 64

2.4. An approach based on stochastic and axiomatic models


of the objective function ........................................................ 70

2.4.1. Stochastic models ...................................................... 70


2.4.2. Global optimization methods based on stochastic models ......... 71
2.4.3. The Wiener process case .............................................. 72
2.4.4. Axiomatic approach .................................................... 75
2.4.5. Information-statistical approach ...................................... 76

PART 2. GLOBAL RANDOM SEARCH ......................................... 77


CHAPTER 3. MAIN CONCEPTS AND APPROACHES OF GLOBAL
RANDOM SEARCH .................................................. 77

3.1. Construction of global random search algorithms:


Basic approaches ................................................................ 77

3.1.1. Uniform random sampling ............................................ 77


3.1.2. General (nonuniform) random sampling ............................ 80
3.1.3. Ways of improving the efficiency of random
sampling algorithms .................................................... 81
3.1.4. Random coverings ..................................................... 83
3.1.5. Formal scheme of global random search ............................ 85
3.1.6. Local behaviour of global random search algorithm ............... 86

3.2. General results on the convergence of global random


search algorithms ................................................................ 88

3.3. Markovian algorithms ........................................................... 93

3.3.1. General scheme of Markovian algorithms ........................... 93


3.3.2. Simulated annealing .................................................... 94
3.3.3. Methods based on solving stochastic differential equations ....... 98
3.3.4. Global stochastic approximation: Zielinski's method .............. 99
3.3.5. Convergence rate of Baba's algorithm .............................. 100
3.3.6. The case of high dimension .......................................... 105
CONTENTS ix

CHAPTER 4. STATISTICAL INFERENCE IN GLOBAL RANDOM SEARCH. 114

4.1. Some ways of applying statistical procedures to construct global


random search algorithms ..................................................... 114

4.1.1. Regression analysis and design ..................................... 115


4.1.2. Cluster analysis and pattern recognition ............................ 116
4.1.3. Estimation of the cumulative distribution function, its density,
mode and level surfaces ............................................. 116
4.1.4. Statistical modelling (Monte Carlo method) ....................... 119
4.1.5. Design of experiments ............................................... 121

4.2. Statistical inference concerning the maximum of a function .............. 122

4.2.1. Statement of the problem and a survey of the approaches


for its solution ......................................................... 122
4.2.2. Statistical inference construction for estimating M ................ 125
4.2.3. Statistical inference for M, when the value of the tail index
ex is known ............................................................ 127
4.2.4. Statistical inference, when the value of the tail index
ex is unknown ......................................................... 133
4.2.5. Estimation ofF(t) ..................................................... 135
4.2.6. Prior determination of the value of the tail index ex ............... 137
4.2.7. Exponential complexity of the uniform random sampling
algorithm ............................................................... 145

4.3. Branch and probability bound methods ..................................... 147

4.3.1. Prospectiveness criteria .............................................. 147


4.3.2. The essence of branch and bound procedures ..................... 148
4.3.3. Principal construction of branch and probability bound
methods ................................................................ 148
4.3.4. Typical variants of the branch and probability bound method ... 150

4.4. Stratified sampling ............................................................. 156

4.4.1. Organization of stratified sampling ................................. 156


4.4.2. Statistical inference for the maximum of a function based
on its values at the points of stratified sample ..................... 158
4.4.3. Dominance of stratified over independent sampling ............. 167

4.5. Statistical inference in random multistart .................................... 174

4.5.1. Problem statement .................................................... 174


4.5.2. Bounded number of local maximizers .............................. 175
4.5.3. Bayesian approach .................................................... 176
x CONTENTS

4.6. An approach based on distributions neutral to the right .................... 179

4.6.1. Random distributions neutral to the right and their properties ... 179
4.6.2. Bayesian testing about quantiles of random distributions ........ 182
4.6.3. Application of distributions neutral to the right to construct
global random search algorithms ................................... 182

CHAPTER 5. METHODS OF GENERATIONS ........................................ 186

5.1. Description of algorithms and fonnulation of the basic


probabilistic model ............................................................. 186
5.1.1. Algorithms ............................................................. 186
5.1.2. The basic probabilistic model ........................................ 190

5.2. Convergence of probability measure sequences generated


by the basic model. ............................................................. 192

5.2.1. Assumptions ........................................................... 192


5.2.2. Auxiliary statements ................................................... 195
5.2.3. Convergence of the sequences (5.2.7) and (5.2.8)
to e* (dx) ............................................................... 200

5.3. Methods of generations for eigen-measure functional estimation


of linear integral operators ..................................................... 203
5.3.1. Eigen-measures of linear integral operators ........................ 203
5.3.2. Closeness of eigen-measures to e* (dx) ............................ 205
5.3.3. Description of the generation methods .............................. 207
5.3.4. Convergence and rate of convergence of the generation
methods ................................................................ 209

5.4. Sequential analogues of the methods of generations ....................... 215

5.4.1. Functionals of eigen-measures ....................................... 215


5.4.2. Sequential maximization algorithms ................................. 217
5.4.3. Narrowing the search area .......................................... 218
CHAPTER 6. RANDOM SEARCH ALGORITHMS FOR SOLVING
SPECIFIC PROBLEMS ............................................. 219
6.1. Distribution sampling in random search algorithms for solving
constrained optimization problems ........................................... 219

6.1.1. Basic concepts ......................................................... 219


6.1.2. Properties of D(x) ..................................................... 220
6.1.3. General remarks on sampling ........................................ 222
6.1.4. Manifold defined by linear constraints .............................. 223
6.1.5. Unifonn distribution on an ellipsoid ................................ 223
6.1.6. Sampling on a hyperboloid ........................................... 225
CON'IENTS xi

6.1.7. Sampling on a paraboloid ............................................ 227


6.1. 8. Sampling on a cone ................................................... 227

6.2. Random search algorithm construction for optimization in


functional spaces, in discrete and in multicriterial problems .............. 229

6.2.1. Optimization in functional spaces ................................... 229


6.2.2. Random search in multicriterial optimization problems .......... 231
6.2.3. Discrete optimization ................................................. 233
6.2.4. Relative efficiency of discrete random search ..................... 235

PART 3. AUXILIARY RESULTS ................................................. 239

CHAPTER 7. STATISTICAL INFERENCE FOR THE BOUNDS OF


RANDOM VARIABlES ............................................ 239

7.1. Statistical inference when the tail index of the extreme value
distribution is known .......................................................... 239

7.1.1. Motivation and problem statement .... '" ........................... 239


7.1.2. Auxiliary statements .................................................. 241
7.1. 3. Estimation of M ................. '" ................................... 246
7.1.4. Confidence intervals for M .......................................... 250
7.1. 5. Testing statistical hypotheses about M ............................. 254

7.2. Statistical inference when the tail index is unknown ....................... 259
7.2.1. Statistical inference for M ............................................ 259
7.2.2. Estimation of a. ........................................................ 262
7.2.3. Construction of confidence intervals and statistical hypothesis
test for a. ............................................................... 265

7.3. Asymptotic properties of optimal linear estimates .......................... 268

7.3.1. Results and consequences ........................................... 268


7.3.2. Auxiliary statements and proofs of Theorem 7.3.2 and
Proposition 7.1.3 ..................................................... 269
7.3.3. Proof of Theorem 7.3.1.. ............................................ 276

CHAPTER 8. SEVERAL PROBlEMS CONNECTED WITH


GLOBAL RANDOM SEARCH .................................... 284

8.1. Optimal design in extremal experiments ..................................... 284

8. 1. 1. Extremal experiment design ......................................... 284


8.1.2. Optimal selection of the search direction ........................... 285
8.1. 3. Experimental design applying the search direction (8.1.15) ..... 289
xii CONTENTS

8.2. Optimal simultaneous estimation of several integrals by the Monte


Carlo method .................................................................... 292

8.2.1. Problem statement. .................................................... 292


8.2.2. Assumptions ........................................................... 296
8.2.3. Existence and uniqueness of optimal densities ..................... 297
8.2.4. Necessary and sufficient optimality conditions .................... 300
8.2.5. Construction and structure of optimal densities .................... 302
8.2.6. Structure of optimal densities for nondifferentiable criteria ...... 303
8.2.7. Connection with the regression design theory ..................... 306

8.3. Projection estimation of multivariate regression ............................ 308

8.3.1. Problem statement. .................................................... 308


8.3.2. Bias and random inaccuracies of nonparametric estimates ....... 309
8.3.3. Examples of asymptotically optimal projection procedures
with deterministic designs ............................................ 314
8.3.4. Projection estimation via evaluations at random points ........... 315

REFERENCES ........................................................................... 321

SUBJECT INDEX ......................................................................... .337


FOREWORD

The investigation of any computational method in such a fonn as it has been realized on a
certain computer is extremely complicated. As a rule, a simplified model of the method
(neglecting, in particular, the effects of rounding and errors of some approximations) is
being studied. Frequently, stochastic approaches based on a probabilistic analysis of
computational processes are of great efficiency. They are natural, for instance, for
investigating high-dimensional problems where the detenninistic solution techniques are
often inefficient. Among others, the global optimization problem can be cited as an
example of problems, where the probabilistic approach appears to be very fruitful.
The English version of the book by Professor Anatoly A. Zhigljavsky proposed to the
Reader is devoted primarily to the development and study of probabilistic global
optimization algorithms in which the evolution of probabilistic distributions corresponding
to the computational process is studied. It seems to be the first time in the literature that
rigorous results grounding and optimizing a wide range of global random search
algorithms are treated in a unified manner. A thorough survey and analysis of the results
of other authors together with the clearness of the presentation are great merits of the
book.
A. Zhigljavsky is one of the representatives of the Leningradian theoretical
probabilistic school studying Monte Carlo methods and their applications. In spite of his
young age, he also participates in writing well-known monographs on experimental
design theory and, besides, on detection of abrupt changes of random processes.
Certainly, the book is going to be interesting and useful for a great number of
mathematicians dealing with optimization theory as well as users employing optimization
methods for solving various applied problems.

Professor Sergei M. Ennakov


Leningrad University
USSR
PREFACE

Optimization is one of the most important fields in contemporary computational and


applied mathematics: significant efforts of theoreticians as well as practicioners are
devoted to its investigation.
If discrete and multicriteria optimization problems are neglected, then optimization
theory can be divided into local and global optimization. The former may be regarded as
being almost completed. On the contrary, the latter is now at the height of its development
that is confirmed by observing the dynamics and contents of the related publications.
Part 1 of the present book attempts to elucidate the current state-of-art in global
optimization theory. The author admits that it is impossible to describe all known solution
approaches, but supposes that the present review is rather complete and may be of
interest, since most recent books (viz., Dixon and Szego, eds. (1975, 1978), Strongin
(1978), Evtushenko (1985), Fedorov ed. (1985), Zhigljavsky (1985), Zilinskas (1986),
Pardalos and Rosen (1987), Tom and Zilinskas (1989» as well as surveys (e.g. Sukharev
(1981), Archetti and Schoen (1984), Rinnooy Kan and Timmer (1985» are devoted
mainly LO particular directions in global optimization.
The variety of global optimization problem statements is rather great: it was
heuristically, numerically, and theoretically justified that different methods may be suitable
for different classes of problems. Global random search methods occupy a peculiar place
among them, as sometimes they offer the unique way of solving complicated problems.
A random choice of solutions is traditional under uncertainty: recall vital situations
when a choice was conditioned by coin-tossing. Application of random decisions in global
optimization algorithm construction leads to global random search algorithms which are
popular among users as well as theoreticians. Part 2 of the book thoroughly describes and
investigates such algorithms.
Apparently, the main popularity reasons of global random search methods among
users are due to the following attractive features which many of them possess: the
structure of these methods is rather simple and they can be easily realized as subroutines
(in multiprocessor computers as well), they are rather insensitive to irregularity of the
objective function behavior and the feasible region structure, as well as to the presence of
noise in the objective function evaluations and also to the growth of dimensionality.
Besides, it is easy to construct methods that guarantee global convergence or to realize
various adaptation ideas. As for theoreticians, the global random search methods represent
a rich and not very complicated subject matter for investigation and possible inclusion of
statistical concepts and procedures.
Of course, the global random search methods have also their drawbacks. First, the
standard convergence rate of these methods is slow and can be generally increased only if
the probability of failure in reaching the extremum is allowed to increase. Second,
practically efficient methods usually involve heuristic elements and thus have
corresponding disadvantages. Third, their efficiency can be often increased by means of
increasing complexity and decreasing their randomness. Thus, the usage of global random
search can be beneficial primarily in cases, when the optimization problem is complicated
enough but the objective function evaluation is not very time-consuming.
The global random search theory is connected (occasionally, rather intensively) with
several other branches of mathematical statistics and computational mathematics: some of
them are considered in Part 3 of the book.
The author's own research results compose nearly half of the mathematical contents of
the book: they are accompanied by proofs unlike the results of other authors whose proofs
can be found in the corresponding references.
xv
xvi PREFACE

Many sections of the book can be read almost independently. A large part of its
exposition is intended not only for theoreticians, but also for users of the optimization
algorithms.
The Russian variant of the book published in 1985 was considerably revised. Thus,
Chapters 1-3 and Section 4.4 were completely rewritten; various recent results of other
authors were reconsidered. On the other hand, the exposition of a number of original
results of minor importance for global random search theory was reduced.
I am indebted to Professor S.M. Ermakov for his valuable influence on my scientific
interests. I am grateful to Professor A. Zilinskas for his help in reviewing the current state
of global optimization theory and to Dr. Pinter for careful reading of the manuscript that
had led to its substantial improving. I also wish to thank Professor G.A. Mikhailov and
Professor F. Pukelsheim for many helpful discussions on the subject of Section 8.2 and
E. P. Andreeva for the help in translation and for the careful typing of the manuscript.

Anatoly A. Zhigljavsky
Leningrad University
USSR
LIST OF BASIC NOT ATIONS

R..n is the n-dimensional real Euclidean space; R=Rl


X is the feasible region, usually a compact subset of Rn of nonzero Lebesgue
measure, having a sufficiently simple structure ( X is considered as a
measurable space)
:8 is the a-algebra of Borel subsets of X
P is a metric on X.
l1n is the Lebesgue measure on R n or on xcR n
f is the objective function given on X
f' is the set of possible objective functions f
Lip(X,L) is the set of functions f on X. satisfying the Lipschitz condition:

If(x) - f(z)1 $; Lllx - zll for all x,Z E X


where /!:!/is the Euclidean norm
Lip(X,L,p) is the set of functions f satisfying

If(x) - f(z)i $;Lp(x,z) for all X,ZEX


min f (x) is a short-hand notation for

min f(x)
XE 'X
the same concerns the operations inf, max, sup
f g(x)dx is a short-hand notation for

f g(x)dx
'X
z* = arg min g( z)
XEZ

is arbitrary global minimizer of a function g on the set Z, i.e. a point


z* E Z such that

g(z*) = ming(z)
ZEZ
the same concerns the operation max
x*=arg min f(x) or x*=arg max f(x) (depending on the context)

B(x,E,p) = (z EX : p(x,z) :s; E}


B(x,E) = {ZEX : IIx,z II $;E}
B(E) = B{ x*, E}
A(E)={XEX: i f(x*)-f(x)i :S;E}
xvii
xviii

is the indicator of a set A, 1 (XE A} is the indicator of an event (XE A}, i.e.

if xeA
lA(x)= 1 {xeA} {~ if xi!OA

di = i! IG!(i - j)!) for i ~j


nu) is the gamma - function: J
r(u) = tu-1e-tdt
o
fal is the smallest integer not less than a
vrai sup 11 is the essential supremum of a random variable 11, i.e.
vrai sup 11 = inf {a: Pr{ 11 :::;a}=l},
the notation aN-bN (N~oo) expresses the fact that the limit

exists and equals 1 (the relation - is called asymptotic equivalence)


the notation aN :=< bN (N~oo) expresses the fact that

(the relation:=< is called weak equivalence)


o is the empty set
1m is the unit matrix of order mx m
independent (or repeated) sample is a sample consisting of independent realizations of a
random variable (vector)
random variables (vectors) and their realizations are denoted by the same symbol.
PART 1. GLOBAL OPTIMIZATION: AN OVERVIEW

CHAPTER 1. GLOBAL OPTIMIZATION THEORY: GENERAL


CONCEPTS

This chapter is of introductory character: it considers various statements of the global


optimization problem, the most commonly used types of prior information concerning the
objective function and the feasible region, the main solution approaches, several classes of
practical problems, and an algorithm comparison dilemma.

1.1 Statements of the global optimization problem

The most common form of the global optimization problem is the following. Let 'X be a
set called the feasible region and f: 'X ~R) be a given function called the objective
function, then it is required to approximate the value

f* = inf f(x) (1.11)


xe'X

and usually also a point in 'X at which f takes a value close to f* .


This problem is called the global minimization problem. The maximization problem, in
which the value

M= sup f(x)
xe'X

is to be approximated, can be treated analogously and can obviously be derived from the
minimization problem by substituting -f for f.
A procedure of constructing a sequence (xk) of points in X converging to a point in
which the objective function value equals or approximates the value f* is called a global
minimization method (or algorithm). The types of convergence may be different, e.g.
convergence with respect to values of f or convergence with some probability. As in
practice the amount of computational effort is always restricted, hence only a finite
subsequence of (xk) can be generated. For constructing these, one usually tries to reach a
desirable (or optimal) accuracy, spending the smallest possible (or a bounded)
computational effort.
Depending on the statement of the optimization problem (1.1.1), prior information f
and 'X, as values of f (and perhaps, its derivatives) at previous points of the sequence
(and occasionally also at some auxiliary points) may be used for constructing a search
point sequence (xk). Sometimes it is supposed that f is some regression function: this
way, the evaluations of f may be subjected to random noise.
2 Chapter 1

As a rule, the global optimization problem is stated in such a way that a point x*e 'X
exists in which the minimal value /* is attained. We shall call an arbitrary point x*e 'X
with the property f(x*)=f* a global minimizer and denote it by

x* = arg min f(x) (1.1.2)


xeX

or, more simply, by x*=arg min 1- In general, this point is not necessarily unique.
As a rule, the initial minimization problem is stated as the problem of approximating a
point x* and its value /*. Sometimes the value /* is required only, but a point x* is not:
naturally such problems are somewhat simpler.
Approximation of a point x*=arg min f and the value f* is usually interpreted as
finding a point in the set

A(O) = {xe'X: I f(x)-f(x*~ .:::; o} (1.1.3)

orin the set

B(E) = B(x*, E,p) = {xe'X: p(x,x*),:::; E} (1.1.4)

where p is a given metric on 'X, 0 and E determine the accuracy with respect to function
(or argument) values.
The complexity of the optimization problem is determined mainly by properties of the
feasible region and the objective function. There is a duality concerning the properties of
'X and f. If explicit forms of f and 'X are known and f is complicated, then the
optimization problem can be reformulated in such a way that the objective function
transforms into a simple one ( for instance, linear) but the feasible region becomes
complicated. The opposite type of reformulation is possible, too. Usually global
optimization problems with relatively simple-structured sets 'X are considered (as they are,
in general, easier to solve even if the objective function is complicated).
Unlike the local optimization problems, a global one can not be solved in general, if 'X
is not bounded. The boundaries of 'X correspond to the prior information concerning the
location of x*. The wider boundaries 'X has, the larger is the uncertainty of the location of
x* (i.e. the more complicated is the problem). Supposing the closedness of 'X and the
continuity of f in a neighbourhood of a global minimizer we ensure that x*e 'X, i.e. the
global minimum of f is attained in 'X.
Typically, 'X can be endowed with a metric p, but usually there are many different
metries and there is no natural way of choosing a representative amongst them. The
ambiguity of the metric choice is connected, for instance, with the scaling of variables.
The properties of the selected metric influence the features of many optimization
algorithms: therefore its selection must be performed carefully. In the case 'XcRn it is
supposed that p is the Euclidean metric unless otherwise stated.
Global Optimization Theory 3

If xcRn , then the dimensionality n of X determines the complexity of an extremal


problem to large extent. One-dimensional problems (the case XCR) are thoroughly
investigated (see Section 2.3.1); multidimensional problems (XcRn, n>l) are however,
of the main interest. Optimization problems in infinite dimensional (functional) spaces are
usually reduced to finite dimensional problems (see Section 6.2.1). Discrete problems in
which X is a discrete set are rather specific: to solve some classes of such problems
special methods are needed. Nevertheless, general discrete optimization methods are
constructed in the same way as the commonly-used finite dimensional ones (see Section
6.2.3). For omitting superfluous technical details, we shall suppose in Chapters 1-4 that
X is finite-dimensional and compact.
As it was mentioned above, the structure of X is assumed to be relatively simple.
Different algorithms require various structural simplicity features of X. In a considerable
number of algorithms (e.g. see Section 2.2) X is supposed to be the unit cube

X == [0, l]n. (1.1.5)

Some of these algorithms may be reformulated for a wider class of sets. But even in case
of a hyperrectangle

(1.1.6)

such a reformulation may be ambiguous and may require a lot of care (as the metric on the
cube differs from the metric induced by the corresponding metric on a hyperrectangle after
transforming it into the cube). There are classes of algorithms (in particular, random
search algorithms) which require only weak structural restrictions concerning the set X.
Below either the type of X is explicitly indicated or its structure is supposed to be simple
enough. In many cases it suffices that 'X is compact, connected, and it is the closure of its
interior (the last property guarantees that Jln( 'X»O and Jln(B(x,£»>O for all xeX, £>0).
Sometimes the set 'X is defined by constraints and has a complicated structure. In such
cases the initial optimization problem is usually reduced to the problem on a set of simple
structure by means of standard techniques worked out in local optimization theory
(namely, penalty functions, projections, convex approximation and conditional directions
techniques). Special algorithms for such problems can also be created, see Sections 6.1
and 2.3.4.
The degree of complexity of the main optimization problem is partially determined by
properties of the objective function. Typically, one should select a functional class f'
which f belongs to, before selecting or constructing the optimization algorithm. In
practice, f' is determined by prior information concerning f. In theory, the setting of f'
corresponds to the choice of a model of f. The wider f' is, the wider the class of allowable
practical problems is and the less efficient the corresponding algorithms are.
The widest reasonable functional class f' is the class of all measurable functions. It is
too wide, however, and thus unsuitable in modelling global optimization problems. In a
sense, the same is true for the class f' == C(X) of all continuous functions (and for classes
4 Chapter 1

of continuously differentiable functions Cl( X ), C2( X ), etc. as well): this results from
the existence of such two continuous (or differentiable) functions whose values coincide at
any fixed collection of points but their minima may differ by any magnitude. On the other
hand, the class of uniextrernal functions is too narrow because the corresponding extremal
problems can be successfully treated in the frames oflocal optimization theory.
Unlike in local optimization, the efficiency of many global optimization algorithms is
not much influenced by the possibility and computational demand of evaluation/estimation
of the gradient \11 or the Hessian \121 of 1, since the main aim of a global optimization
strategy is to find out global features of 1 (while the smoothness characterizes the local
features only).
Naturally, the computational demand of evaluating the derivative of 1 influences the
efficiency of a global optimization strategy only as much as a local descent routine is a part
of the strategy, see Section 2.1.
The computational demand of evaluating 1 is of great significance in constructing
global optimization algorithms. If this demand is high, then it is worth to apply various
sophisticated procedures at each iteration for receiving and using information about f. If
this demand is small or moderate, then the simplicity of programming and the absence of
complicated auxiliary computations are characteristics of great importance.
Global optimization problems in presence of random noise concerning evaluations of f
are even more complicated. Only a small portion of the known global optimization
algorithms may be applied to this case. Such algorithms will be pointed out individually;
hence, it will usually be supposed that 1 is evaluated without noise.
The selection of a global optimization algorithm must be also based on a desired
accuracy of a solution. If the accuracy required is high, then a local descent routine has to
be used in the last stage of a global optimization algorithm, see Section 2.1.1. Some
methods exist (see Section 2.2) in which the accuracy is an algorithm parameter. It should
be indicated in this connection that under a fixed structure of an algorithm, the selection of
its parameters must be influenced by a desired accuracy. For example, Section 3.3
describes an algorithm whose order of convergence rate will be twice improved, if the
parameters of the algorithm are chosen depending on the accuracy.
Concluding this section, let us formulate a general statement of the optiMization
problem which will be taken automatically if opposite or additional details are not mserted.
Let the compact set 'Xc:Rn (n ~ 1) have a sufficiently simple structure; furthermore, let a
bounded (from below) function 1: 'X-7:R 1 belonging to a certain functional class :r be
given. It is then required to approximate (perhaps, with a given accuracy) a point x* and
the value 1* by using a finite number of evaluations of 1 (without noise).
Global Optimization Theory 5

1.2 Types of prior information about the objective function and a


classification of methods
This section describes various types of prior infonnation concerning the objective function
used in the construction and investigation of global optimization methods as well as a
classification of the principal methods and approaches in global optimization, represented
in a tabular form.

1 .2 .1 Types ofprior iriformation

The role of prior information - conceming the objective function f and the feasible region
Xc:R.n - in choosing a global optimization algorithm is hard to overestimate. Having
more information, more efficient algorithms can be constructed. Section 1.1 described the
information that is considered as always available. But it does not suffice to construct an
algorithm solving an extremal problem with a given accuracy over a finite time. Therefore
it is worth of taking into account the more specific properties of the objective function
while constructing global optimization algorithms.
There exist various types of prior information about f determining functional class f'.
Usually f' is determined by some conditions on f : a number of typical cases are listed
below.

a) f'CC(X).

a') f'cCl(X).

a") f'cCl(X) and the gradient Vf of fef' can be evaluated.

a"') f'cC2(X).

a 'V) f'cC2(X) and Hessian V2 f of f e f' can be evaluated.

b) f'CLip(X,L,p) where L is a constant and p is a metric on X.

b') f'CLip(X,L) where L is a constant.

c) f'C{fe Cl(X): Vfe Lip(X,L)} for some constant L .

c') f'cUeC2(X): IIV2f II~M}forsomeconstantM.

d) f'cU(x,9), gee} where e is a finite dimensional set: that is, f is given


in a parametric form.
6 Chapter 1

d') f is a polynomial of degree not higher than p.

d") f is (not positive definite) quadratic.

e) f is a rational function.

e') f is given in an algebraic fonn.

f) :F is a set of functions which can be approximated by functions from a


certain class f' o.

g) f is concave.

g') f= f 1- f 2 where f 1 and f 2 are concave.

h) f has exactly f iocal minimizers.

h') f has not more than f local minimizers.

i) f = f 1+ f 2 where f 1 is a uniextremal function and II f 2 II is much less


than II fIll.

i ') f = f 1+ f 2 where f 1 depends on no more than s<n variables and II f 2 II is


much less than II fIll.

i" ) f = f 1+ f 2 where f 1 is linear and II f 2 II is much less than II fIll.

j) f is a separable function, i.e.

(12.1)

j') For appropriately defined two-dimensional functions f 2> ••• ' f n


the representation

(1. 2.2)

is valid.

k) :F={j(x,ro), roE Q} where (Q, 3, P) is an underlying probability space,


i.e. f' is the set of realizations of a random process or field.
Global Optimization Theory 7

k') The prior distribution of a global minimizer x* is given.

1) f' is axiomatically given, i.e. it is supposed that the objective function


satisfies some axioms.

m) The measure of domains of attraction of all local minimizers is not less than
a fixed value ')'>0.

m ') The measure of the domain of attraction of a global minimizer is not less
than a fixed value ')'>0.

mil) Iln{A(8)} /Iln {'X}~where 8 and "( are given positive values, Iln is the
Lebesgue measure, and A( 8 ) is defined by (1.1.3).

n) The global minimizer x* of f is unique.

n ') There exists such ()0>0 that for any ()E (0, ()o) the set A( (» is connected
and
Iln (A«»}>O.

0) The smoothness conditions of the type a) - c) are fulfilled in a vicinity of


the global minimizer.

p) There exist positive numbers e, 13, cl' c2 such that the inequalities

are valid for all XE B(e).

p') There is a homogeneous function H(z) of degree 13>0 such that

f(x)-f(x*) =H( x-x*) +0(11 x-x*II~)

for II x-x* II ~O.

q) f is evaluated with a random noise.

Let us shortly comment on these conditions.


The most familiar condition is b). There are known optimal and nearly optimal
algorithms (in a well-defined sense) for the case when b) holds. An unavoidable drawback
of the algorithms corresponding to this case is that the choice of the metric p may be quite
arbitrary; further, the exact (smallest) Lipschitz constant L is typically unknown, while the
8 Chapter 1

numerical characteristics of the algorithms are greatly influenced by the choice of p and L,
see Section 2.2.
Assumptions a) - j) are often supposed to hold, when constructing and investigating
detenninistic global optimization algorithms which do not contain random elements. These
algorithms are shortly described in Sections 2.1, 2.2, 2.3.
No fewer number of suggestions are encountered for constructing and investigating
probabilistic algorithms, based on diverse statistical models of the objective function or of
the search procedure. Probabilistic models of f E f' are the basis of many Bayesian,
information-statistical and axiomatically constructed algorithms described mainly in
Section 2.4 and using the conditions b), f), k) - 1) and some others. Probabilistic models
of the search process correspond to global random search algorithms: their investigation
may be closely related to conditions m) - q), k'), d), f), h), h') and some others.
Information concerning these suppositions and cases of using them is contained in
Table 1 of the next section and in the following chapters of the book.

1.2 .2 Classification of principal approaches and methods of the global optimization

This section contains a table classifying the principal approaches and methods of global
optimization as well as some additional explanation.
Certainly, it is impossible to enumerate all global optimization methods in an
unambigous fashion: Table 1 provides a possible classification. Not all notations used are
common: alternative terms for these and related methods may be found in the
corresponding sections and references.
The first two columns of Table 1 give the names of approaches and methods (or
groups of methods). The first five approaches mostly include deterministic methods. A
family of probabilistic methods includes the methods of the last two approaches, together
with methods based on smoothing the objective function and screening of variables,
random direction methods, many commonly used versions of multi start and the candidate
points method, as well as some versions of a number of other methods. Let us note again
that the classification of approaches and methods in Table I is somewhat arbitrary, since
some methods (in particular, random covering, random multi start, polygonal line methods
and the method based on smoothing the objective function) may be in various groups.
The third column of Table 1 gives typical conditions (from among those described in
Section 1.2.1) imposed on f for construction and investigation of methods. It should be
noted that the condition coflections for the majority of methods are not full or accurate
since different versions of some methods require slightly different suggestions, and not
for all methods are known the precise conditions ensuring their convergence.
According to Section 1.1, the feasible region X is mostly assumed to be a compact
subset ofR..n, n~l, having a nonzero Lebesgue measure and a relatively simple structure,
unless additional details are given. The fourth column of Table 1 contains such details
required for the realization of the corresponding method.
The fifth column provides the number of the section describing or studying the
method. There are many corresponding references in these sections. Of course, it is
impossible to mention all works devoted to global optimization and, as intended, those
works are referred that contain much information related to the corresponding subject.
The sixth and seventh columns reflect the state of theoretical and numerical foundation
of the methods. Under theoretical basis of a method we mean the existence of theoretical
results on convergence, rate of convergence, optimality, and decision accuracy depending
Global Optimization Theory 9

on the dimension of X as well as recommendations on the choice of method parameters.


Needless to say, there is an element of subjectivity in our notes ( as e.g. since the time of
finalizing this book, new results became known).

1.2.3 General properties of multiextremal functions


Not all the works connected with global optimization theory deal with construction,
investigation or application of optimization methods. For example, some recent
investigations are devoted to asynchronous or parallel global optimization algorithms, but
the author does not know impressive results in this field. Another group of works studies
general properties of multiextremal functions: these works investigate the integral
representations (see Section 2.3.3) and derive conditions on f ensuring that its every
local minimum is also global. Outline the latest topic.
Let y be a real number and

be a level set of f . Consider the set

Zang and Avriel (1975) state that every local minimum of f is a global minimum of f if
and only if the point-to-set mapping Lr RI -til is lower semicontinuous at any point
yE G. The lower semicontinuity of Lf at a point yE G means that if XE Lf (y), {Yi}cG,
Yi-ty (i-too) then there exist a natural number K and a sequence {xiJ such that
xiELf (Yi), i~K, and xi-tx(i-t oo).
Tne next result of Gabrielsen (1986) is closely connected with the preceding one.

Proposition 1.2.1. Let f: Rn-tRI be a function and X={xERn: f(x)<c} where c is a


real number. Suppose that X is open, connected, and not empty; f is continuously
differentiable in X;for every 10>0 there exists a compact subset DE of X such that f (X)~-E
for all xe: DE; every stationary point of f is a strict local minimizer of f. Then there exists
only one point X*E X such that Vf (x*)=O and f attains the global minimum at this point.
The main condition in Proposition 1.2.1 is that every stationary point of f would be a
local minimizer. If f is twice continuously differentiable in X then this condition takes the
next form: if XEX and Vf(x)=O then the Hessian V2f(x) is a positive definite matrix.
Proposition 1.2.1 presents a sufficient criterion of uniextremality of a function f.
Gabrielsen (1986) used it to investigate unimodality of the likelihood function which plays
a prominent role in mathematical statistics. Demidenko (1988) used almost the same
....
0

Table 1. Principal Approacbes and Methods of Global Optimization

Approach Method Conditions Conditions Section Theoretical Amount of Numerica1 Not e s


on f onX Ground Results
2 3 4 5 6 7 8

Based on the multistart a) ·a· V ). 2.1.3 elementary different for in the pure form is
useofloca1 h) ·h'), various versions used very seldom
search m) ·m")
techniques

candidate points 2.1.3 elementary sufficient numerical results


a) ·a'V), indicate high
(using a cluster h) ·h'),
analysis technique) efficiency
m) ·m")

tunneling 2.1.4 different for sufficient the method is pros·


a') ·a'V), various pective but hard to
h' versions but judge its efficiency
not sufficient
on the whole

Branin a'v) 2.1.5 not sufficient sufficient generally is not


sufficiently
reliable and
efficient
heavy ball a") 2.1.5 poor sufficient has poor efficiency
and reliability

based on solution of a" ) 2.1.5 different for different for efficient versions
differential equations a'" ) various various exist for some
versions versions classes of problems
based on smoothing 0), i), 2.1.6 few no results
poor showing on high
the objective probably
function q) efficiency
g

1\
..,
.....
c;'l
<S'"
[
Table I. cont.
~
2 3 4 5 6 7 8 §.
N'

grid sufficient optimal with


a
Covering b),b'),c') mainly 2.2.1 high g'
methods search respect to some
aV) X= [0, l]n criteria, but ~
practically
inefficient "c
..::!
polygonal 2.2.2 high sufficient efficient for the
line b), b')
case of small
dimensions, bnt
requires a consider-
able amount of
auxiliary
computations
Evtushenko b), b'), X = [0, I]n 2.2.2 relatively sufficient efficient enough
high for the case
c' )
n= 1 and sometimes
also for the case of
small dimensions
Strongin b), b') 2.3.1 relatively sufficient efficient for
X qo, l]n high small n
sequentially b), b') 2.2.3 high no results has a theoretical
best significance only
(Sukharev)

Branch based on use e' ) hyperrectangle 2.3.4 relatively small complicated for
and bound of convex minorants high realization, for
methods and concave majorants each objective
function requires
a separate
construction
interval e' ), probably hyperrectangle 2.3.4 relatively different for complicated
q) high various versions realization in the
multidimensional
case
concave minimi- d" ), convex, 2.3.4 relatively different for the feasible region
zation on a convex polyhedron high various versions may have a high
set g), g'),
dimension
i" )

-
N

Table I. cont.
2 3 4 5 6 7
Based on coordinate-wise a), b) small analogous to
dimension 2.3.2 poor
minimization the corresponding
reduction local minimization
algorithm
random directions b), d') 2.3.2 poor sufficient inefficient
except the case
d')withp:54
multistep dimension b), 2.3.2 relatively different for efficient for some
reduction j ), j') high various versions panicular cases
( for instance,
casej) )
based on use of b) n different for no results
2.3.2 relatively
Peano curve type xqo,l] various versions showing high
high
mappings efficiency
in the case n 2: 3
based on hyperrectangle 2.3.2 poor small there are examples
i'), a)
screening of efficient solu-
of variables tion of complicated
problems
Based on appro- based on 2.3.3 relatively sufficient some versions are
f), aV )
ximation and approximation high optimal, but numerical
integral repre- of objective results indicate poor
sentations function efficiency
based on integral m J, n J 2.3.3 dilierent for small no resUlts showmg
representation 0), p) various versions high efficiency

Based on sta- Bayesian k) mainly relatively sufficient efficient, but can


2.4.2
tistical and high forthecase become cumbersome
2.4.3 (especially in
axiomatic models xc :R1 n=l
multidimensional cases)
2.4.5 relatively no results has no practical
information b), k') mainly significance
high
-statistical I
~c&
axiomatic 2.4.4 relatively sufficient there are many
I)
high examples of
efficient solution for
some classes of g
optimization problem

~
...,
'-
Q
6'"
[
Table 1. cont. ~
2 3 4 5 6 7 8 [g'
§.
Global random random m'), 0) 3.1.1 high sufficient poor efficiency, but the
search sampling 3.1.2 method serves as a 6'
component and a ;:,
basic guideline for ~
some other global
random search methods 'c"
~
random covering b), m') 3.1.4 relatively few more efficient
high than the preceding
method

simulated n), 0) 3.3.2 relatively not much mostly of theoretical


annealing possibly q) high importance

based on solution a'), a"'), 3.3.3 relatively few mostly of theoretical


of stochastic possibly q) high importance, perspective
differential of practical use is
equations not clear enough yet
Markovian 0), mOO), 3.3 relatively not much mostly of theoretical
possibly q) high importance

random multistart a) -a' V), 2.1.3 relatively sufficient the main version
4.5 high of multistart
h) -h'),
m) -m" )

branch and 0), p), p'), 4.3 relatively sufficient there are a number of
probability sometimes high variants, some of them
bound d), k), k'), proved to be highly
n), m"), efficient
q)

generation n '), 0), Ch.5 different for sufficient inefficient for simple
methods q) and various versions problems, but
others promising for
complicated
multidimensional
problems

.....
w
14 Chapter 1

criterion for verifying the uniextremality of the sum of deviation squares (that determine
the least squares estimators) and other functions arising in identification of nonlinear
stochastic models. The last mentioned work also generalized the above criterion to the case
where the set 'X={XE:R.ll: f(x)<c} is not connected and investigated the property of local
uniextremality, i.e. the uniextremality of f on the connected parts of 'X.
Global Optimization Theory 15

1.3 Comparison and practical use of global optimization algorithms


Let us suppose that a user looks for a computer program or a package of computer
programs that would solve optimization problems on a particular computer with an
acceptable accuracy in a suitable time. If the problems are complicated then he looks for
efficient programs or algorithms and needs the results of algorithm comparison.
This section discusses the problem of comparing global optimization algorithms and
partly answers the question above: i.e. how the user should choose a suitable algorithm or
program. Numerical comparison is considered firstly.

1.3 .1 Numerical comparison

Numerical comparison of global optimization algorithms consists of accomplishing


numerical experiments on a computer via solving some (test or practical) multiextremal
problems. Numerical comparison is not a particular feature of global optimization, but an
important part of investigation in various fields of applied mathematics. In local
optimization, classes of test functions, efficiency criteria, and conditions of numerical
experiments are standardized, see Crowder et al. (1978), Hock and Schittlowski (1981) or
Jackson and Mulvey (1978). Although some recommendations of the above works are
suitable also for global optimization, an analogous standardization does not exist at
present.
Unfortunately (but naturally), it is very hard to draw accurate conclusions on the
efficiency of (global) optimization algorithms from results of numerical experiments. The
reasons of this are diverse: (i) the results are unavoidable subjective, (ii) the classes of
optimization problems on which a given algorithm (or program) is efficient are scarcely
formalized, and (iii) there are many different efficiency criteria. Let us briefly comment on
these reasons.
An algorithm and the corresponding computer program are of different nature. Two
programs realizing the same algorithm but made by different prigrammers may be of
different efficiency: this is the first ground for subjectivity in numerical comparison of
results.
Most of the global optimization algorithms depend on parameters whose choice is
ambiguous and hardly formalizable. This is the second basis for subjectivity and the
explanation of the fact that the author of an algorithm may usually exploit it much more
efficiently than an ordinary user.
Poor reporting accuracy and deficiency of numerical results represent the third ground
for subjectivity. These cause difficulties in estimating the influence on algorithms
efficiency of optimization problem features such as dimension, number of local
minimizers, size of domain of attraction of the global minimizer, etc.
In order to overcome another difficulty, connected with the diversity of computers on
which the numerical experiments are performed, it is universally adopted to express the
computing time in standard time units which correspond to 1000 evaluations of the so-
called Shekel function

(1.3.1)

for n=4 and m=5 where ai and ci (i=l, ... ,m) are fixed vectors and numbers.
16 Chapter 1

It is very hard to determine a class of optimization problems on which a given


algorithm is numerically efficient. To do this, one experiments on various test (or
practical) multiextremal functions that hypothetically reflect several specific function
features. It is a difficult problem to choose a collection of multivariate test functions
reflecting required function features. Standard collections of test functions (see e.g. Dixon
and Szego, eds. (1978), Walster et al. (1985) and some other works) include functions
with a rather simple structure and thus are hardly able to reflect delicate features. Let us
mention only the test function of Csendes (1985)

x. E [- 1, 1]\{O}, (13.2)
1

which has a countable infinity of local minimizers. The work on standardizing of global
optimization test function classes is being investigated by other researchers, thus we shall
stop here further discussion of the topic.
The third reason for doubtfulness of numerical comparison is due to the existence of
various efficiency criteria used in global optimization. The most common efficiency
criterion is the time (expressed in standard units) required to reach a given solution
accuracy. Sometimes the number of objective function evaluations is used instead of time.
Reliability is the second important criterion that is very difficult to estimate in complicated
setups. The simplicity of programming and the computer memory required are some other
quality criteria of global optimization algorithms.
It follows from the above that numerical comparison studies per se are hardly able to
measure the efficiency of global optimization algorithms appropriately and are not able to
satisfy all requirements of the users. In many cases, a rigorous theoretical study gives
more information on the efficiency. Let us indicate below several efficiency criteria for the
theoretical comparison of global optimization algorithms.

1 .3 .2 Theoretical comparison criteria


(i) Domain of algorithm convergence. It is the set of optimization problems for which a
point sequence generated by the algorithm converges to a global minimizer. The wider the
domain of convergence is, the more universal is the algorithm. As it will be clear later, the
convergence domain can be determined for a great number of algorithms. The difficulty
of using the domain of convergence as a quality criterion arises due to the variety of
convergence types, especially so for probabilistic algorithms.

(ii) Speed of convergence. It is studied only for some groups of algorithms and serves
mainly for comparison of algorithms whithin these groups since the variety of types of
convergence speed and methods of their estimation is great (much more than that of the
convergence domains).

(iii) Inaccuracy. Let dN be a deterministic rule (Le. algorithm) of a sequential generation


of N points x}. ... ,xN from 'X and
Global Optimization Theory 17

The inaccuracy (error) of an algorithm dN for a function f is usually defined as

(1. 3.3)

and the same for a functional class :F can be defined as

N N
£(d ) = sup £(f,d ). (1.3.4)
fEf'

If it is possible to determine a measure A(df) on the functional class:F, then one can use
the inaccuracy

£(dN )= J£(f,dN)A(df) (1.3.5)


f'

instead of (1.3.4). For probabilistic algorithms of Section 2.4, the mean value E{ £ (dN ) }
is usual for measuring in accuracy, replacing £ (dN ) defined by (1.3.4) or (1.3.5).
Inaccuracies of type (1.3.4) or (1.3.5) are important efficiency characteristics of global
optimization algorithms, but their utility range is very restricted since they can be evaluated
only for a small part of methods dN and functional classes:F.

(iv) Optimality. According to the ordinary definition of optimality (in the minimax sense),
an algorithm is called optimal, if it minimizes the inaccuracy (1.3.4) for a fixed N or the
number N for a fixed value of £(dN ). For some functional classes :F (of the type
F=Lip(X,L,p)) optimal methods exist and can be constructed. Various numerical results
and theoretical studies show, however, that the above optimality property gives almost
nothing from the practical point of view. This is connected with the fact that the minimax-
optimal method gives good result for the worst function from a given class:F, but for an
ordinary one it may be much worse than some other algorithm.
The use of the inaccuracy measure (1.3.5) leads to the Bayesian concept of optimality.
Some other concepts of optimality (stepwise, Bayesian stepwise, asymptotical,
asymptotical by order, etc.) also exist, but the observation that theoretical optimality of a
particular method does not guarantee its practical efficiency, is valid again.
Many works in global optimization theory are devoted to the problem of optimal
choice of algorithm parameters or their components: in essense, these works consider the
optimality of parameters over narrow classes of algorithms.
All the abovesaid leads to the conclusion that even if a method may be theoretically
investigated, the comparison of its efficiency to the efficiency of a method belonging to
another approach is complicated and hardly formalizable because of the great variety of
algorithm characteristics and optimization circumstances. Further problems arise when
heuristic methods are investigated, see Zanakis and Evans (1981).
In conclusion, while studying the efficiency of a global optimization method, it is
worthwhile to use a composite approach including the determination of a class of
optimization problems that is being solved by the method, the investigation of theoretical
18 Chapter 1

properties of the method, and its numerical study. We shall mainly be concerned with the
theoretical study in this book.

1.3 .3 Practical optimization problems

It is hard to find branches of science or engineering that do not induce global optimization
problems. Many such problems arise e.g. in the following fields: optimal design,
construction, identification, location, pattern recognition, control, experimental design,
etc. Instead of detailed description of the corresponding classes of optimization problems
or particular ways of their solution we refer to a number of papers in the Journal of
Optimization Theory and Applications, and also to Dixon and Szeg6, eds. (1975, 1978),
Batishev (1975), Mockus (1967), Zilinskas (1986), Zhigljavsky (1987) and make a few
additional comments.
Experimental design theory (see Ermakov et al. (1983), Ermakov and Zhigljavsky
(1987)) leads to a wide class of complicated multiextremal problems: some of them
occasionally regarded as test ones. For instance, Hartley and Rund (1969), Ermakov and
Mitioglova (1977), and Zhigljavsky (1985) used the function

f (x)=det Ill,x.,x6
1
.,x~,x.x6
+1 1 1
.,x 26+.11
+1 1.
6 (1.3.6)
1=1

that is to be maximized on the set

(1.3.7)

A maximizer of (1.3.6) is the determinant of a 6x6 matrix that depends on 12 variables:


we remark that corresponds to the optimal saturated D-optimal design for the second-
degree polynomial regression

on the square (x,y) E [1,6]x[I,6]. The function (1.3.6) is complicated to optimize, since it
has several thousands of local maximizers, it has a great number of global maximizers,
and its relatively high dimension eliminates the possibility of using most of the
deterministic methods. Note that one can use the feasible region

instead of (1.3.7), furthermore, that test problems similar to (1.3.6) - (1.3.7) were treated
in Bates (1983), Bohachevsky et al. (1986), Haines (1987).
The more capacity contemporary computers have, the greater is the actuality of
optimization in simulation models. These are optimization problems of functions subjected
to a random noise that can be controlled. The optimization problems of mathematical
models with an objective function solving a differential/integral equation are closely related
to the above-mentioned ones. Their main peculiarity is connected with the nonrandomness
of the noise that can be controlled as well.
Global Optimization Theory 19

Formally, a great number of optimization problems arising in economics, optimal


design, and optimal construction are not represented in the standard form described above
since (i) the feasible region can be either fully or partly discrete, and (ii) the problem can
be of multicriterial character. However, a number methods described in the book may be
applied to such problems. Indeed, it was pointed out that the discreteness of X does not
exclude the possibility of applying the majority of global optimization approaches.
Moreover, multicriterion optimization problems are usually reduced to a number of global
optimization ones for objective functions which are transformations of the initial criteria.
Since these functions are defined on the same feasible region, the grid methods of Section
2.2.1 are rather efficient for solving these problems. For further discussion see Section
6.2.
CHAPTER 2. GLOBAL OPTIMIZATION METHODS

According to the classification of Section 1.2.2 this chapter describes some principal
approaches and methods of global optimization except the global random search methods
which Part 2 is devoted to. We shall consider the minimization of the objective function f
(belonging to a given functional set:F and evaluated without a noise) on the feasible region
'X (a compact subset ofRP, having a sufficiently simple structure).
Section 2.1 deals with global optimization algorithms based on the use of local
methods. They are widespread in practice, but many of them are not theoretically
investigated to a considerable extent.
Section 2.2 is devoted to the covering methods including passive grid searches. The
situation here is opposite to the above: the covering methods are thoroughly studied
theoretically, but their practical importance is not great.
Section 2.3 considers one-dimensional optimization algorithms, reduction and
partition techniques. The most attractive of them are the branch and bound methods.
Under various (generally, substantial) prior information about f they combine the practical
efficiency with theoretical elegance.
Section 2.4 treats the approach based on stochastic and axiomatic models of the
objective function. Here the balance of the theoretical and practical significance of the
algorithms does not hold. Most of them are cumbersome and time-consuming when being
realized.

2.1 Global optimization algorithms based on the use of local search


techniques
The bulk of the theory and methodology of local optimization have actually been
developed around the turn of 1960's. By that time, global optimization theory as such did
not exist and global optimization methodology almost had not been discussed. Thus it is
not surprising that most global optimization algorithms of the sixties and seventies, having
a practical importance, were based on the use of local optimization methodology. Over the
past decade the situation changed, but global algorithms based on local methods remained
popular both in theory and practice.
Section 2.1.1 presents some local algorithms. Section 2.1.2 describes general ways of
using local algorithms to construct global optimization methods. One of the most popular
global optimization algorithm is called multi start. Its basic form and its modifications are
discussed in Section 2.1.3. Section 2.1.4 is devoted to tunneling algorithms consisting of
sequential local minimization of functions constructed by iteratively transforming the
objective function. Section 2.1.5 is concerned with algorithms of transition from some
local minima to others. The construction of such algorithms is closely connected with
solving some kinds of differential equations or simultaneous differential equations.
Section 2.1.6 considers algorithm consisting of local minimization of a function obtained
by means of smoothing the objective function.

20
Global Optimization Methods 21

2.1.1 Local optimization algorithms

First various setups of the local optimization problem are considered.

Let X be a measurable subset of Rn, n~l, and f: X-+Rl be a lower bounded


measurable function (objective function).A point x* is called a local minimizer of f if
there exists an E>O such that f(x).::;f(x*) for all XE:B( x*,e). If f has a unique local
minimizer, then it coincides with the global minimizer of f, i.e. x*=arg min f and function
f is then referred to as uniextremal. The local optimization problem is the one of an
approximate determination of a local minimizer of f for which values (and sometimes
derivatives as well) at any point XEX can be evaluated.
IfXERn, then the problem is called the unconstrained local optimization problem; in
the opposite case it is constarined and is also called the mathematical programming
problem. In the particular case, when f and X are convex, it is called the convex
programming problem.
In local optimization algorithms a sequence of points x l>x2, ... from X converging to
x* under certain assumptions is constructed. The way of constructing this sequence
depends on properties of f and X, information of f and X used at each iteration, and the
degree of simplicity desired in the construction.
The point sequence {Xi} in unconstrained local optimization algorithms is constructed
by a recursive relation

k=I,2, ...

where xIEX is an initial point, sk is the search direction, and 'Yk~0 is the step-length.
Local minimization algorithms differ in the way of constructing {'Yk} and {skI that
usually results in descent (or relaxation) algorithms for which the inequalities
f(Xk+l).::;f(Xk) hold for all k=I,2, .... To this end, it is necessary to find {SkI such that
sk'Vf(Xk)<O for each k=I,2, ... and to choose 'Yk: in a suitable way. Two such ways of
selecting'Yk are the most well-known. The first one is to choose

(2.1.1)

where f k('y)=f (xk+')'Sk) is the one-dimensional function determined by f, Xk and sk. This
way is primarily of theoretical significance.
The second way is as follows. Let Y1>0 and /3>1 be some real values, 'Yk-l be the
preceding step-length, sk be the search direction at xk. If the inequality

(2.1.2)
22 Chapter 2

holds then set

where

i=max{j=1,2, ... 1 f(Xk+'Yk:-1[3.isk)< f(Xk)}


or
i=max{j=1,2, ... 1 f(Xk+'Yk:-1[3.isk)< f(Xk+'Yk:-1[3.i-1 sk) }.

If (2.1.2) does not hold then set

In the case ~=2 this way is referred to as the bisection method.


Depending on the information used for constructing sk, local optimization algorithms
can be divided into three groups, viz., direct search algorithms that make use of values f
only; first-order ones, using also first derivatives of f; and second-order ones, using also
second derivatives of f . (Of course, f is supposed to be smooth enough while
considering the latter two groups of methods.)
First and second-order algorithms are mostly special cases of the generalized gradient
algorithm in which sk=-Ak Vf (xk) and Ak (k= 1,2,... ) are some positive definite matrices.
For the gradient method Ak=In for all k~l where In is the unit (nxn)-matrix. If, in
addition, Yk are chosen as in (2.1.1), then the algorithm is called the steepest descent
method. In the Newton method Ak=[V2f(Xk)]-I. In variable-metric methods Ak are
recursive approximations of [V2f (xk)]-l. One of the most well-known variable-metric
method is due to Davidon, Fletcher and Powell, for this method (2.1.1) holds and

Ak+1 = A k( Akaka'kA~)/(a'kAkak) + llkll'/(ll'kak)


where

A general class of first-order local minimization algorithms is composed by the method of


conjugate directions. This class contains e.g. a version of the Davidon-Fletcher-Powell
algorithm and the Fletcher-Reeves method, for the latter (2.1.1) holds and
Global Optimization Methods 23

The direct search algorithms usually apply the bisection method for determination of step-
lengths. At each k-th iteration of many search algorithms, that of the two directions +e or-
e is taken as sk for which f decreases. In the coordinate-wise search algorithms e belongs
to the collection {q, ... ,e n } of the unit coordinate vectors. The cyclic coordinate-wise
descent selects ei sequentially, but the random coordinate-wise search does it at random.
The random search algorithm with uniform trial draws e as a realization of a random
vector uniformly distributed on the unit sphere S={XE Rn: II x-xkll =l}. In random
searches with learning, e is a realization of a random vector whose distribution depends on
the previous evaluations of f. The random m-gradient method (1 <m<n) selects
m
Sk = - i:1 q £f (x k + aq) - f (x k) ] / a
where a is a small positive value and ql, ... ,qm are orthonormalized vectors constructed
by means of the orthogonalization procedure from m independent realizations of random
vectors uniformly distributed on S.
A thorough description and investigation of local optimization technique can be found
in many textbooks, see, for instance, Avriel (1976), Dennis and Schnabel (1983) or
Fletcher (1980). Among others Fedorov (1979), Demyanov and Vasil'ev (1985),
Mikhalevitch et al. (1987) treated the nondifferentiable case. It should be noted that the
local optimization routines available in contemporary software packages are suitable for
most practical needs.

2.1.2 Use of local algorithms in constructing global optimization strategies

If the objective function f is multiextremal, then the application of a local optimization


algorithm leads to a local extremum which is generally not a global one. Thus, in order to
find a global extremum, one has to use another technique ensuring global optimization.
The local optimization techniques, however, takes an important place in the global
optimization methodology. This is mainly due to the usual construction of global
optimization strategies consisting of two stages: global and local. At the global stage, the
objective function is evaluated at points located, so to speek, to cover X: its aim is to
reach a small neighbourhood of a global optimizer.
The local stage of global optimization algorithms may be expressed in explicit or
implicit form and used one or more times. An implicit form of the local stage is present in
a global algorithm when several points are chosen in neighbourhoods of some points
obtained previously. An explicit form of the local stage is present in the following cases: if
a crude approximation of a global optimizer is obtained and it has to be refined; or if one
needs to simply obtain values of the objective function as small as possible (in some
particular algorithms considered below).
24 Chapter 2

Ordinarily local algorithms are aimed at locating a global optimizer x* making more
precise the approximation of f*= f(x*). To do this, one should apply a local descent
routine starting from an initial point x(o) that is an approximation of x* as obtained by a
global optimization method.Roughly speaking, in that case the aim of the global stage is
the localization of x* and the aim of the local one is finding the precise location of x*.
Applying the above approach, it is necessary to assume that the point x(o) belongs to the
neighbourhood of x* in which f is sufficiently smooth and has only one local minimum.
The efficiency of a considerable number of global minimization methods depends on
closeness of the values fk * and f* for all k~1. In these methods it is natural to make some
iterations of a local descent right away after obtaining a new record point.
Occasionally, if iterations of a local descent are easily computable, then it is
advantegous to start them from several points obtained at the global stage. This problem
for random optimization technique is discussed later in Section 3.2.
Note that if a local optimization algorithm is part of a global optimization strategy, then
it should be supposed that the objective function is subjected to some local conditions
besides the global ones. Moreover, it can be concluded from the cited properties of local
optimization algorithms (see Section 2.1.1) that it is especially advantageous to include
them into a global optimization strategy, if the derivatives of the objective function may be
evaluated without much effort.

2 .1.3 Multistart

The technique under consideration is concerned with the method recently called multi start
which has been historically the first and for a long time the only widely used method of
global optimization. This method consists of multiple (successive or simultaneous)
searches of local extrema starting at different initial points. The initial points are frequently
chosen as elements of a uniform grid (such grids are described later in Section 2.2.1)
To use multi start, one should impose suggestions on the number of local optimizers of
the objective function and also on its smoothness.
The main difficulty for practical use of multi start is the following: to obtain a global
optimizer with high reliability one should choose the number of initial points much greater
than the number of local optimizers (which is usually unknown). So the main part of
computer efforts would be taken for attaining local extrema repeatedly. If it is supposed
that the local optimizers of the objective function are rather far from each other ( a heuristic
suggestion), then the basic version of multi start is often modified by one of the two ways
described below.
The first version is to surround every evaluated local minimizer by a neighbourhood
and attribute the next property to it: if a point of search attains the neighbourhood then it
moves into the local minimizer. The proper use of the corresponding algorithm is possible
only if one is able to choose neighbourhoods which either are subsets of the
corresponding domain of attraction of the local minimizers or surely do not contain a
global optimizer. In the last case the global minimization algorithm is a covering method
and is studied in Section 2.2.
The second way is called the candidate points method and consists in simultaneous
local descents from many initial points and joining neighbouring points. The joining action
is equivalent to substituting such points by one of them (whose objective function value is
the smallest). The problem of joining belongs to cluster analysis; this way any clustering
method may be used for this purpose. One of the most simple and popular methods of
Global Optimization Methods 25

joining is the so-called nearest neighbour method based on the heuristic supposition that
the distance Pij=P(Zi,Zj) (in metric p) between points zi and Zj belonging to a
neighbourhood of a local minimizer is less than the distance between the points belonging
to the neighbourhoods of different minimizers.

Algorithm 2.1.1. (nearest neighbour method).

1. Let K points Z1 "",zK be given. Assume that all of them belong to various
clusters, i.e. the number of clusters equals K (here the word cluster is associated
with neighbourhood of a local minimizer).
2. Find the nearest pair of points (Zi,Zj), i.e. such a pair for which

p .. = min P
IJ k ~=1 K,k~ k~

3. If the distance Pij between the nearest neighbours zi and Zj does not exceed a
certain small 8>0, then the points zi and Zj are incorporated and the corresponding
clusters are unified; hence the number of clusters K decreases by 1 (i.e. K-1 is
substituted for K).
4. If the distance between the nearest neighbours exceeds 8 or K=l (only one cluster
remains) then the algorithm stops. Otherwise go to Step 2.

In Algorithm 2.1.1 the metric P is usually chosen to be Euclidean. The goodness of


Algorithm 2.1.1 depends on the choice of 8 as well. This number must be small enough,
at any rate, less than any distance between a pair of local minimizers.
The typical candidate points method is as follows.
Algorithm 2.1.2 (candidate points method).

1. Obtain N points x1, ... xN generating them uniformly on X (i.e. x1, ... xN are the
elements of a uniform random sample.
2. By performing several iterations of a local descent routine from initial points
x 1> ... xN obtain points z}, ... ,zN (the number of the local descent iterations from the
points Xi may depend on values of the objective function f(Xi), i=l, ... ,N).
3. Apply a cluster analysis method (for instance, Algorithm 2.1.1) to the points
zl>""zN' Let m be the number of clusters obtained.
4. Select representatives Xl ,... xm from the clusters ( a natural selection criterion is
the objective function value). If m=l, or the number of the local descent iterations
exceeds a fixed number then go to Step 5. Otherwise put N=m and go to Step 2.
5. Suppose we are in the neighbourhood of a global minimizer. In case of
necessity, apply a local optimization routine which has high speed of convergence in
the neighbourhood of an extremal point.

Algorithm 2.1.2 is heuristic, its efficiency depends on parameters and auxiliary methods
choice. The number N must be significantly greater than the expected number of local
26 Chapter 2

minima of the objective function. The choice of Algorithm 2.1.1 as a cluster anlysis
method and a local optimization routine available in a computer software is usual choice.
Note that quasirandom grids have almost the same prevalences as random ones for
selection of the initial points in candidate points methods.
Algorithm 2.1.2 and some similar methods (see Torn (1978), Spircu (1978), Batishev
and Lubomirov (1985)) are appealing in case of some complicated multiextremal
problems. The well-known method of Boender et al. (1982) is also based on the above
principles and proved to be efficient for solving a number of multivariate global
optimization problems. Its recent version goes under the name of multi-level single linkage
method and is summarized as follows.
Algorithm 2.1.3 (Multi-level single linkage method)

1. Set k=l, 2=0.


2. Generate N uniformly distributed on 'X random points, evaluate f at each one,
and add them to the point collection 2.
3. Select points from 2 as initial ones for local descent.
4. Perform local minimization from all initial points selected.
5. Check a stopping rule. If the algorithm is not terminated, then return to Step 2
substituting k+ 1 for k.

At each k-th iteration the next selection procedure is applied at Step 3 of Algorithm 2.1.3.
First, cut off the sample 2 and choose ykN point of 2 with the lowest function values
where yis a fixed number in (0,1]. Each chosen point x is selected as an initial point for a
local descent if it has not been used as an initial point at a previous iteration, and if there is
no neighbour sample point ZE 2nB(x,l1k) with lower function value where the critical
distance 11k is given by

where 8 is positive constant.


The works of Rinnooy Kan (1987), Rinnooy Kan and Timmer (1985,1987a,b)
contain much more details on the theme of using cluster analysis techniques in global
optimization algorithms. Note that statistical inferences in the multistart in which the initial
points are the elements of the random grid will be studied in Section 4.5 under the term
random multistart .

2.1.4 Tunneling algorithms


Over the past decade the so-called tunneling algorithms gained popularity. Their essence
consists in the successive search for a new point X(o) in the set

(2.13)
Global Optimization Methods 27

after obtaining a record point X*, and then descending from the point x(o)'
A tunneling algorithm consists of two stages: a minimization stage and a tunneling
one. These stages are used sequentially. In the minimization stage a local descent routine
is used for finding a local minimizer x* of the objective function. In the tunneling stage,
the so-called tunneling function (it was called penalty or filled function as well) T(x) is
determined. This function attains a maximum (may be, a local maximum) at x*, has
continuous first derivatives at all points (probably, except x*) and depends on f, x*, and a
number of parameters that are automatically chosen by the algorithm. After constructing
the tunneling function, a point x(o) from the set (2.1.3) is sought for at the tunneling stage.
Then one proceeds to the minimization stage and finds a local minimizer of f starting at
x(o)' The obtained local minimizer is a new record point and the above iteration may be
repeated. The search stops if it does not succeed in finding a point X(o)E U( x*) in a given
number of algorithmic steps.
Vilkovet al (1975), Levy and Montalvo (1985) determined the tunneling function by

(2.14)

where a>O is a fixed parameter. If there are some record points x*(I), ... x*<n with the
same objective function value then Levy and Montalvo (1985) proposed to construct the
tunneling function by

T(x) = (f (x) - f (x. i~Jlx -x~f'


)J/ [ 1
Ge (1983) determined the tunneling function (calling it a/illed/unction) via

(2.15)

where II and 13 are parameters, 13>0,

(2.1. 6)

Ge and Qin (1987) modified (2.1.5) in different ways and presented five new tunneling
functions, viz.,
28 Chapter 2

Tt + }(X) exp { -llx - x*11 / p2},

2
- {p 10g[Tt + f(x)] + Ilx - X*liP}, p = 1,2
- (f(x)- f(x*» exp{p21Ix- x * liP}, p= 1,2.

To find a point X(o)EU(X*) in the tunneling stage needs performing a local minimization of
the tunneling function starting at an initial point Xo belonging to a rather small
neighbourhood of the record point x*. If the local minimizer of T does not belong to the
set (2.1.3) or the search trajectory goes out of 'X not attaining U(x*) then it is advisable to
change values of parameters (parameter a in case of (2.1.4) and parameters Tt, 13 in case
of (2.1.5» and to descend into a local minimizer of a modified function T. If after several
changes of values of parameters a point from the set U(x*) is not found then it is advisable
to perform the same actions using another initial point xo. For example, it is natural that
such points are sequentially selected from the collection of points

xo= x* ± rei' i = 1, ••• ,n, (2.1.7)

where 'Y is a small number, ei are the coordinate vectors.


Vilkov et al. (1975) showed that if 'X is an interval, function f is continuously
differentiable and has a finite number of local minimizers, the tunneling function has the
form (2.1.4) and its minimization under all parameter values a>O is performed, then the
above defined tunneling method converges to a global minimizer of the objective function
f. For the multidimensional case, a similar statement on the convergence of the tunneling
method using function (2.1.4) is absent.
Let us more thoroughly consider the form (2.1.5) for the tunneling function. Suppose
that f is continuously differentiable, L is a bound for the Euclidean norm of its gradient,
i.e. such a constant that the inequality II Vf (x) II ~L for all XE 'X holds, x * is a global
minimizer of f, x* is a local minimizer, 8( x*) is the subset of the domain of attraction of
x* containing points x such that

D(x*) is the radius of the set 8( x*), i.e. the shortest distance from x* to the boundary of
8( x*), D is the smallest radius of all subsets 8( x*), i.e.
Global Optimization Methods 29

where the minimum is taken over all local minimizers of f.


The next two statements are due to Ge (1983).

Proposition 2.1.1. If the inequalities (2.1.6) and

(2.1.8)

hold, then the function (2.1.5) can not have any stationary point in the set

(2.1.9)

Proposition 2.1.2. If (2.1.6) holds and the ratio ~2/(T)+t<x*» is small enough to
assure the inequality

(2.1. 10)

then the function (2.1.5) has no local minimizers in the interior of 'X.
Moreover it is evident that if (2.1.6) holds then x* is a local maximizer of the function
(2.1.5).
Proposition 2.1.1 shows that if the ratio ~2/(T)+f(x*» is small enough (satisfies the
inequality (2.1.8» then the function (2.1.5) has no stationary points in the set (2.1.9).
Hence, any local descent algorithm applied to the function (2.1.5) and starting from a
neighbourhood of x* should either arrive to the boundary of'X or reach the set U( x*).
On the other hand, Proposition 2.1.2 yields that if this ratio is too small (satisfies the
inequality (2.1.10», then any local descent algorithm applied to the tunneling function
(2.1.5) should arrive to the boundary of 'X. Hence, in this case the aim of the tunneling
stage can not be reached and further search of a global minimizer of f would be
impossible.
The main difficulty of constructing a suitable version of the tunneling algorithm is
connected with the problem of choosing its parameters T),~ (or, respectively, a and ai).
Under their favourable choice the convergence of the algorithm may be ensured. Let us
now formulate the basic version of the Ge algorithm.
Algorithm 2.1.4. (Ge (1983».

1. Find a local minimizer x* of the function f starting at an arbitrary point xl; set k= 1.

2. Form the tunneling function T by (2.1.5) (for the modified tunneling functions
presented above the algorithm is analogous, see Ge and Qin (1987».
30 Chapter 2

3. Sequentially starting at initial points (2.1.7) descend to local minimizers of the


tunneling function T until either a local minimizer of T or a point from U( x*) is found. If
none of the local minimizers of T or a point from U( x*) is found then go to Step 8.

4. Use a local descent routine for function f, starting at the point otitained at Step 3. Let
z* be the obtained local minimizer of / .

5. If f(z*).s f(x*) then substitute z* for x* and go to Step 2.


6. If f(z*) > f(x*) and k<K (where K is a given number) then increase values of 11 and
~, so that the ratio ~2 / (11+/ (x*» decreases. Go to Step 3 but use the initial point

instead of (2.1.7) and z* instead of x* in formula (2.1.5).


7. If f(z*) > /(x*) and k~K then the algorithm stops and gives the point x* as a global
minimizer of /.

8. Decrease 11 to increase the ratio ~2/(11+/(x*», replace k by k+l and go to Step 2.

In Algorithm 2.1.4 the choice of a number K (defining the stopping criterion) and change
rules for the values of parameters 11, ~ are heuristic, but they influence considerably the
efficiency of the algorithm.
Some more information about the tunneling method realization may be found in Levy
and Montalvo (1985) and Ge and Qin (1987). The analysis of numerical results yields that
the tunneling method can not be referred to as a very efficient one (in particular, it does not
always succeed in finding the global minimizer). The main difficulties of realizing the
tunneling method are the following: the tunneling function under certain values of
parameters (ex. in (2.1.4) and 11, ~ in (2.1.5» are flat and close to zero in a considerable
part of X; local minimization of a tunneling function needs to be carefully performed in
order not to pass over a minimizer; trajectories of a local optimization of T often arrive the
boundary of X; the termination of the search is problematic (in essence, the stopping
problem is equivalent to the main optimization problem). Nevertheless, the tunneling
methods have only been recently created and the progress in the field of increasing their
numerical efficiency is quite probable.

2.1 .5 Methods of transition from one local minimizer into another


As already noted, local optimization theory is better developed than its global counterpart.
Hence the attempts of some authors to modify local methods in such a way that a search
trajectory might pass over from a local minimizer to another, in order to find a global
minimizer, seem to be natural. The idea has served as a base for creating some heuristic
algorithms (including the algorithms of this and the preceeding section). In this section we
Global Optimization Methods 31

shall suppose that the objective function f is sufficiently smooth (as a rule, twice
continuously differentiable).
A simple transition algorithm may be constructed by means of alternating descents to
local minimizers with ascents to local maximizers. As an initial direction of an ascent
(descent) the latter direction of the preceding descent (respectively, ascent) is natural to use
in this algorithm. Its disadvantage consists in the possibility of cycling at a collection of
some local minimizers and failing to reach a global minimizer. Another disadvantage is
that a lot of evaluations of the objective function and its derivatives are wasted on
investigating nonprospective subsets (in particular, on ascents to maximizers). The former
disadvantage may be removed by means of surrounding minimizers by ellipsoids in order
to prevent waste descents (Treccani et al. (1972». But already in the bivariate case the
above algorithm is rather complicated for realization (see Codes (1975» and in case of
greater dimensions the possibility of its efficient realization is indeed problematic.
Many global minimization methods of this section originate from investigating
properties of solutions of various differential equations. One of the first such attempts is
the heavy ball algorithm (Pshenichnij and Marchenko (1967» according to which the
search trajectory coincides with the trajectory of a ball motion on the surface generated by
the objective function. Globality of the heavy ball algorithm associated with that of a
moving ball by inertia may pass over flat hollows (but may stop in one of them).
The search trajectories for a general class of algorithms including the heavy ball
algorithm and algorithms from Zhidkov and Schedrin (1968), Incerti et al. (1979),
Griewank (1981) are discrete approximations of solutions of second order ordinary
differential equations having the form

Il(t)x"(t) + v(t)x'(t) = - Vf(x(t», (2.1.11)

subject to the initial conditions

where Il(t) and v(t) are functions of time t. According to classical mechanics the equation
(2.1.11) represents Newton's law for a particle of mass Il(t) in a potential f subject to a
dissipative force -v(t)x'(t). At the initial time t=to , a particle is at a point Xo and has a
motion direction zoo Given suitable assumptions on functions f, Il,V, any trajectory
converges to a local minimizer of f tending to pass over flat minima.
Let c be a certain upper bound for f*=min f (i.e. c~f*). Griewank (1981) showed
that the search trajectory obtained from (2.1.11) under

Il(t) = [f(x(t» - c]/e, v(t) = - [Vf(x(t»]'

can not converge to a local minimizer with objective function value greater than c; further,
that a point X(o)E'X will be reached with the value f( X(o»~c. The latter algorithm seems
to be the most promising of the mentioned group.
Let us consider another class of global optimization algorithms based on solving
differential equations. The main representative of it is the method developed by Branin
(1972) and Branin and Ho (1972).
32 Chapter 2

Let f be twice continuously differentiable and assume that not only the values J(x) can
be evaluated but also the values of the gradient g(x)=V f(x) and the Hessian H(x)=V2
f(x) for all XE'X. Consider a system of simultaneous differential equations

! g(x(t» = s g(x(t» (2.1.12)

subject to the initial condition g(x(O»=go where s is a constant taking on either value + 1
or -1. The solution of the system (2.1.12) has the form g(x(t»= goe st and if s=l, t~ - 0 0
or s=-I, t~oo it converges to g=O. It means that the trajectory corresponding to a solution
tends to a stationary point of f. Branin method consists of sequentially solving (2.1.12)
(in order to attain a stationary point of f) and alternating the sign of s (in order to pass over
from one stationary point to another). To solve the system (2.1.12) rewrite it as follows
-1
x'(t) = sH (x(t» g(x(t» (2.1.13)

Let A(x) be the adjoint matrix of H(x) which determines the inverse of the Hessian by the
formula H-l(x)=A(x)/det H(x). Then (2.1.1 3) may be replaced by the system

x '(t) = sA (x(t» g(x(t». (2.1.14)

The latter is determined for all x. It is obtained from (2.1.13) by changing the time in case
det H(x):;t:O and by inverting the time while passing over to the points of neglecting det
H(x). If (2.1.14) is used then the sign of the constant s should be alternated when
attaining a stationary point as well as at a point of neglecting det H(x).
An ordinary method of numerically solving these first order differential equations is
based on the use of discrete approximation

k =0,1, ... (2.1.15)

where
. dx
x=-
dt'
is an arbitrary point from 'X,Yo,YI, ... is a certain sequence of nonnegative numbers.
Applying (2.1.15) to solve (2.1.13) we obtain

k=O,l, ...

that is the Newton method with a variable step length Yk. The system (2.1.14) may be
solved analogously.

A comprehensive description of the Branin method and its modifications as well as


numerical results are contained in Hardy (1975), Gomulka (1978). Treccani (1978)
Global Optimization Methods 33

studied a function f whose contours are topologically equivalent to spheres and the Branin
method for minimizing it may not converge. Anderson and Walsh (1986) proposed a
simply realized version of the Branin method for minimization of two-dimensional
functions of special kind.
Branin's idea of changing a system of simultaneous differential equations after
reaching a stationary point is used in Yamashita (1976) to solve a minimization problem
on a set

n} n m
'X.= { xER :h(x)=O where h=(hl' ... ,hm):R -7R ,m<n,

the functions f and hj are three times continuously differentiable and the matrix B(x) with
elements Bij(X)=ohjl OXi is of full rank. His work shows that the local optimizers
f
(maximizers as well as minimizers) of under restrictions h(x)=O are stable states of
trajectories corresponding to systems of differential equations

dh(x)
x= B(x)A.(x) - sg(x), (it=-h(x)

where g(x)=Vf(x), SE {-1,1}, A.:Rn-7Rm is an unknown vector function determined by


the system ( A. is an analog to the Lagrange multipliers). Eliminating A. from the system,
we are led to

x= - sP(x)g(x) - Q(x)h (x)


where
P=I n - B(BBf1B,

The discretization (2.1.15) may be used to solve this system. Sequentially alternating the
constant s sign after reaching a stable state allows to find several local minimizers of f
under restrictions h(x)=O. Certainly, the above approach does not guarantee that a global
minimizer is reached.
All algorithms based on solving differential equations have the following
disadvantages. First, there are no general results on their convergence to a global
optimizer. Hence, one may guarantee that a global minimizer is found only if it is assured
that all local minimizers are found. Second, the above algorithms are relatively
complicated for realization and investigation. In particular, it is necessary to evaluate the
objective function derivatives (for some algorithms also the Hessian) and the possibility of
using finite difference approximations instead of the derivatives is not obvious. It is
difficult to draw a general conclusion on the efficiency of the above algorithms. It should
be noted, however, that there are numerical examples which demonstrate that for fmding a
global optimizer the Branin method requires fewer evaluations of the objective function
than the tunneling algorithms. (Note however that Branin method requires also evaluations
of the first and second derivatives).
34 Chapter 2

2.1.6 Algorithms based on smoothing the objective function

Let a function f satisfy the Lipschitz condition and be presented as f = f 1+ f 2, where f 1 is


a uniextremal function and the supremum-norm of f2 is much less than the norm of f 1>
i.e. sup Ihl« sup If 1 I. According to Katkovnik (1976), the smoothed function

j (x, /3) = Jf(z)p(x -/3z)dz (2.1.16)


is uniextremal for sufficiently large /3>0, while its global minimizer is close to x* for
sufficiently small /3>0. Here p(z) is the kernel of the smoothing operator, being a
continuously differentiable, unimodal, symmetric density of a probability distribution on
R.n. The case of the nondifferentiable kernel
p(z) = 1 A (z), (2.1.17)

where A=[-a,a]n, is analogous and considered in Mikhalevitch et al (1987).


In view of the above qualitative results the following global minimization concept is
sensible: find the minimizer of the uniextremal function (2.1.16) for some /3>0 and
starting from this point try to descend into a minimizer of f. From the point of view of
local optimization theory, the extended algorithm below is sensible, too: the values of /3
are to be sequentially decreased (slowly enough), while the number of steps increases (i.e.
a minimizer is more closely approached). In case of (2.1.17), the latter algorithm was
investigated. But even for this rather simple kernel, the problem of clear determination of
the functional classes for which the algorithm converges is not clarified.
The impossibility of the exact evaluation of the smoothed function (2.1.16) is the main
computational difficulty encountered when realizing the above conceptual algorithms.
Instead of the exact values, Monte Carlo estimators of the form

~ 1 N
J (x, /3) '" N 2. [f (~ .)p(x - /3~ .)l /<p(~.) (2.1.18)
j=l J JJ J

are usually applied, where ~l""'~N are independent realizations of a random vector with
a probability density <p(x) which is positive on X. The use of (2.1.18) implies that the
evaluations of (2.1.16) are subject to random errors. Thus, to minimize (2.1.16) it is
necessary to use a stochastic approximation type algorithm. The solution of the problem
would be facilitated, if analogously to (2.1.18), one could use the Monte-Carlo estimators

v x j (x, /3) '" ~ j=l


~ [f(~')J Vp(x - /3~ J')J/<P(~')J
of the gradient values of (2.1.16): here Vp(z) is the gradient of the density p(z).
Poor numerical and not sufficiently well-founded theoretical results concerning the
above algorithms do not permit to draw a clear conclusion on their efficiency.
Global Optimization Methods 35

2.2 Set covering methods


In practice, the variation rate of an objective function is usually bounded. The set
:F=Lip(X,L,p) of Lipschitz functions represents a well-known example of a class of such
functions. This set contains those functions which satisfy the inequality

1f (x)-f (z) 15 L p(x,z)


for all x, ZE X (again, p is a fixed metric defined on X).
Let the variation rate of the objective function f be known. The knowing its value
f(xi) at a point xiEX a set

(2.2.1)

can be detennined where 0 is a fixed positive number. If the points xl> ... ,xN are chosen in
X in such a way that the subsets Xl •...• X N fonn a covering of the set X
N
~that is Xc UX.),
i=l 1

then the global minimization problem (1.1.1) is solved with accuracy 0, with respect to the
function value. In this case, the inequality

(2.2.2)

hold. where

f * =min f· (2.2.3)

Methods of selecting points xI"",xN, having the above property are called covering
methods and are the main subject of this section.
A covering method may be described in tenns of either subsets Xl>"" XN or points
xI"",xN' In the fonner case one should try to construct the sets 'Xi of a maximal volume
and to reach a simple structure of the sets
k
X\UX.
i=l 1

for all k=I,2, .... In the latter case more fonnal methods to analyse the quality of point
collections :::N={ xlo""xN } or sequences {xl>x2.... } are generally used.
If the way of choosing points xiE:::N or sets Xi (i=l, ... ,N) does not depend on the
values which the objective function takes at the points XjE ::: N G;t:i) then the point set
:::N={Xlo""xN} is called a grid and the corresponding minimization algorithm is called a
grid (or passive covering) algorithm: these will be considered in Section 2.2.1.
36 Chapter 2

Section 2.2.2 describes sequential (active) covering algorithms in which all the
previously chosen points xl, ... ,xk-l and function values f(xl),···, f (xk-l) may be used
when choosing the next point xk (k=2,3, ... ).
Section 2.2.3 deals with the problem of the global optimization algorithms optimality
and of the practical usefulness of optimal algorithms.

2.2.1 Grid algorithms (Passive coverings)

A grid is a point set :::N= (xl> ... ,xN} constructed independently of the function f values
f(xi), i=l, ... ,N. A (passive) grid algorithm is a global minimization method consisting of
constructing a grid ::: N, computing the values f (xi) for all xiE ::: N and choosing the point

(2.2.4)

as an approximation of a global minimizer x*=arg minf (and the value


*
fN=f(x~)
as an approximation of the minimal value f* =min f). It is also possible to add a local
descent routine with the initial point xN* or to construct an approximation of f with the use
of the global minimizer of the approximation instead of (2.2.4).
The grid algorithms are well-known in the global optimization theory: a number of
theoretical studies have been devoted to them. This is the consequence of the following
attributes of the grid algorithms:

a) simplicity of construction,
b) optimality in some well-defined sense,
c) simplicity of investigation,
d) simplicity of realization on a multiprocessor computer.

The attribute a) is relative: this section shows that only a few grids for simply
structured set 'X (usually for cubes or hyperrectangles only) are constructed simply. But
even if a grid is not simply constructed, the corresponding grid algorithm still remains
simple and only one of its stages is difficult for realization~ Let us note that once this stage
is done, it can be used repeatedly, if one solves a multicriterion optimization problem or
several global optimization problems for the same set 'X.
The attribute b) is studied in Section 2.2.3 in detail. It turns out that for some types of
functional classes :F there exist grid algorithms that are optimal, typically in the minimax
(worst case) sense. But from the practical point of view, this optimality property gives
almost nothing: the reason being that the objective functions appearing in practice are
usually not like the functions at which optimal algorithms perform best. Nonoptimal
algorithms hence may be much better, than optimal ones, for most other functions from a
given functional class :F.
Global Optimization Methods 37

The attributes c) and d) follow by the fact that the location of the grid points is
independent of the function f values. The attribute d) is evident and c) is discussed in this
section and Section 2.2.3.
Grid algorithms may be characterized in many different ways. One of the most
important is the formation of a cover of X by the balls

(2.2.5)

where xiE ::: N, P is a metric on X, e is a radius of the balls. If the sets (2.2.5) form a
cover, i.e.

N
Xc UB(x.,e,p) (2.2.6)
i=l 1

then any point from X (and the global minimizer x* too) belongs to at least one ball from
the collection (2.2.5). If f E Lip (X ,L,p) then, under fulfilling (2.2.6), the minimization
problem is solved with the accuracy o=Le with respect to values of f, since

where Xj is the centre of the ball from the collection (2.2.5) which the global minimizer X*
belongs to. Similar results hold also for some other functional classes f' (see Evtushenko
(1985), p. 473).
Thus, the grid algorithms have some theoretically attractive properties; on the other
hand, they have a basic disadvantage, too. Namely, they completely neglect the
information on the objective function that is obtained during the search process. This
disadvantage makes the grid algorithms utterly inefficient, especially for large-dimensional
and complicated optimization problems. In practice, grid algorithms are efficiently used
only for searching the global extremum for several objective functions given on the same
set X (in particular, in multicriteria optimization, see Sobol and Statnikov (1981»; or as
parts of algorithms using random grids or deterministic ones constructed similarly to the
global random search methods of Chapter 5, see Galperin (1988), Niederreiter and Peart
(1986), Galperin and Zheng (1987). Nevertheless the theoretical significance of the grid
algorithms is non-negligible: this is caused by their minimax optimality, simplicity and use
as a pattern in comparative studies of global optimization algorithms.
A degree of uniform distribution in a certain metric is a quality criterion of a grid. Let
us introduce the appropriate notions.
Let !l be a finite measure on the feasible region X (X is considered as a measurable
space). A point sequence
{Xk} 00

k=l
is called uniformly distributed in measure !l in the set X, if xkE X for all k=1,2, ... and
the asymptotic relationship
38 Chapter 2

lim SN(A)/N=~(A)/~(X) (2.2.7)


N~oo

is valid for any measurable subset ACX, where SN(A) is the number of points xk
(19<:~N) belonging to A. If ~= ~n is the Lebesgue measure and (2.2.7) is valid, then the
sequence {xk} is called uniformly distributed in X. A grid :::N is called uniform (in
measure ~), if it contains the first N points of a uniformly distributed sequence (in
measure ~).
Uniformity is an asymptotic property: degrees of grid uniformity are usually expressed
in terms of dispersion and discrepancy defined as follows.
Let p be a metric on X. The value

(2.2.8)

is called the p-dispersion of a grid :::N={ Xl, ... , xN}. If p is the Euclidean metric then
(2.2.8) is called dispersion and is denoted by d(:::N).
The dispersion is one of most generally used characteristics of grid uniformity. If the
unit ball in metric p

{X E R. n: p(O,x) :5: I} (2.2.9)

is symmetrical with respect to change of variable order, then the p-dispersion is a


characteristic of the uniformity of grids.
The above mentioned property of the unit ball symmetricity is valid e.g. for the
Euclidean metric

[.:i
1/2
p(x,z) = (x(i) - Z(i»2] (2.2.10)
1=1

and, more generally, for all ~p-metrics

n
p(x,z) = L IxCi) - z(i) I, (p = 1) (2.2.11)
i=l
lip

p(x,z) = [.r I
1=1
xci) - zeit] l<p<oo (2.2.12)

p(x,z) = I
max xci) - z(i)1 (p=oo). (2.2.13)
l~i~n
Global Optimization Methods 39

(The last metric is also called cubic.) However, the above property is not valid e.g. for the
metric
n
p(x,z) = L a.1 x(i) - z(i)1 (2.2.14)
i=l 1

where the numbers a b ... ,an are nonnegative and not all equal to a fixed number. In
formulae (2.2.10) through (2.2.14) the values x(i) and z(i) for i=I, ... ,n are the
coordinates of points x,zeRn.
The importance of p-dispersion as a characteristic of a grid global optimization
algorithm is explained by the inequality

(2.2.15)

which is valid for any function f e Lip (X , L,p). Generally, the inaccuracy of a grid
algorithm may be estimated with the help of p-dispersion and the modulus of continuity

co p(t) = sup
x,ye x
IfCx) - f(y)1 (2216)
p(x,y)<t

offunction f in metric p. Indeed, simple calculations of Niederreiter (1986) lead to the


inequality

(2.2.17)

Inequality (2.2.15) is a special case of (2.2.17), since cop(t).$.Lt for any feLip(X,L,p).
In this connection it should also be noted that if prior information about the location of the
optimal points is not available then one should prefer uniform grids to non-uniform ones.
If X is the unit cube, that is

x =[O,I]n , (2.2.18)

then the value

oN(SN) = supl S N(B)/N -ll n(B)1 (2.2.19)


B

where the supremum is taken over the set of all hyperrectangles

B=[O,bl]x ... x[O,b n J, O<bj.$.l (j=l, ... ,n)

is also a uniformity characteristic of a grid :::N: this value is called discrepancy.


40 Chapter 2

In case of (2.2.18), dispersion and discrepancy are closely related. In general, small
discrepancy values correspond to small dispersion. Niederreiter (1983) proved that for
any grid ::: N the inequalities

(2.2.20)

are valid. This way, if the evaluation of the dispersion is practically impossible, then the
inequalities (2.2.20) are used for estimating the rate of dispersion.
The next property of grids is important from the practical point of view. A grid is said
to be composite if it keeps its features, when the number of points N is changed into N+ 1
(for each N). Many known grids do not posses this property.
Let us define the most popular uniform grids (their majority can be defined for the case
(2.2.18) only). If the feasible set 'X is not a cube, then a uniform grid can be defined as
follows: find a cube Y containing the set 'X, construct a grid::: in Y and form a grid :::n'X
in 'X.
A cubic grid :::N 1 in 'X=[O,l]n contains N=pn points

(
i 1 + 1/2 i 2 + 11 2 in + 112) ,
P , P p il'i 2 , ... ,i n E {O,I, ... ,p-1},

where p is a fixed natural number. To construct such a grid, one may divide the cube 'X
into pn equal subcubes by dividing every side of 'X into p equal parts and choose the grid
points as the centres of the subcubes.
A rectangular grid ::: N2 for the case of the hyperrectangle

is constructed by dividing every side [ai,bi] of 'X into Pi parts oflengths fi = (bi-ai)/pi
and choosing the points

where ikE {0,1, ... ,Pk-1} for each k=l, ... ,n, N=pl> ... 'Pn. Of course, all cubic grids are
also rectangular.
A rectangular grid :::N2 is uniform in the case Pk=const(bk-ak), k=l, ... ,n , only. In
this context it should be mentioned that the uniformity property is not invariant to scale
transformations. For example, if a cube is transformed into a (non-cube) hyperrectangle,
then a uniform grid in the cube will induce a nonuniform grid in the hyperrectangle.
Cubic and rectangular grids are the simplest. They have some optimality properties (as
it will be shown below), but they are not composite.
We call a grid random, if it consists of N independent realizations of a random vector
uniformly distributed on 'X. Random grids :::N3 are simple to construct (not only for the
case (2.2.18) but also for many other types of sets), being also uniform and composite; on
Global Optimization Methods 41

the other hand their uniformity characteristics (p-dispersion and discrepancy) are far from
optimal.
Random grids are widespread in global optimization theory and in practice: there are
two reasons for this. First, if 'X is neither a cube nor a hyperrectangle, then tremendous
difficulties may be faced during the construction of grids having good uniformity
characteristics. Second, if the values of f in random points are known, then one may use
mathematical statistics procedures to obtain information on the function f and the location
of its extremal points.
Grids :::Ni (i=4,5,6) described below are defined on the unitcube 'X=[O,I]n and are
called quasirandom grids; their elements being called quasirandom points. This
nomenclature refers to the application of these points in a lot of algorithms similar to
Monte-Carlo methods that use random points. The uniformity of quasirandom grids is
better, than that of random ones, but slightly worse than optimal. From the practical point
of view, quasirandom grids are preferable to the optimal ones (including cubic grids): the
reasons for this preference will be given in Section 2.2.3.
The Hammersley-Holton grids :::N4 form an important class of quasirandom grids:
they consist of the N first terms of the Holton sequence that is defined as follows. For
integers Tj~2 and k~llet
~ .
k = I. a.TjI
i=O I

be the Tj-adic expansion of k, the function <J>rt be defined by

~ .
<p (k) = I. a. Tj-I-l
11 i=O I

and PI ,···,Pn be n distinct prime numbers. Then the i-th term xi of a Holton sequence is

x.
l I n
= (<pp (i), ... ,<pp (i», i = 1,2, ...

Lattice grids::: N 5 consist of the points

where {.} denotes the fraction part of a number, and a 1, ... ,a n are to be suitably chosen
from tables (see Korobov (1963), Hua and Wang (1981». Lattice grids are rather popular
in applied mathematics, note that they are not composite.
The computation of binary TI't-grids and Tj-adic TIo-grids ::: N 6 is more complicated,
see Sobol (1969, 1985), Faure (1982). The Tj-adic TI't-grids :::N6 are composite and have
the additional uniformity property. If the number N of the grid points is fixed then the
grids ::: N 4 and::: N 6 are sometimes modified so as i/N is substituted for the last n's
coordinate of the grid points Xi. This gives the grids that are not composite but possess a
little bit better uniformity characteristics, see Niederreiter (1978), Faure (1982).
42 Chapter 2

An intermediate place between the random and Il't-grids is occupied by stratified


sample grids ='N7. These grids contain the elements of a stratified sample in X. To obtain
a grid ='N7 for N=m~ (where m and ~ are natural numbers), any measurable set X, and
any finite measure 11 on X, one must divide X into m parts Xi (i=I, ... ,m) of equal
measure
m
I1(X.)=I1(X)/m, UX.=X, I1(X.(")X.)=O, i:;tj,
1 i=l 1 1 J

and sample ~ times each probability distribution Pi (i=I, ... ,m) determined by

P .(A) = ml1(A(")X .)/I1(X)


1 1

for measurable subsets ACX.


Stratified sample grids are not composite but may be constructed for a wide class of
sets X and surpass the random grids by some characteristics that are important for global
optimization, see Section 4.4.
An important place in global optimization theory is occupied by the grids =. N 8
consisting of the centres of those balls B(xi,£,P) that form a minimal cover of the set X.
Here £>0 is a fixed number, p is a metric and N is equal to the minimal number of balls
forming a cover of X. The grids ='N 8 are not composite and it is usually very hard to
construct them. The exception is the case ofN=pn, and the fulfilment of relations (2.2.13)
and (2.2.18) for which the grids ='N l and ='N8 coincide. The theoretical importance of
grids ='N8 is connected with the optimality (in case 1"=Lip(X,L,p» of the grid algorithms
using them, see Section 2.2.3.
Some other optimal and nearly optimal (for other functional classes 1") grid
optimization algorithms are known, see Ganshin (1979), Ivanov et al. (1985), Nefedov
(1987), Zaliznjak and Ligun (1978). These grids are usually rather difficult to construct
thus being not very suitable for practical purposes.
Let us cite here statements on the dispersion and discrepancy of the cubic, random
and quasirandom grids defined on the unit cube X=[O,I]n: Proofs of these statements can
be found in Niederreiter (1978) and Sobol (1969).
For cubic grids =. N l one has

(2.221)

Forrandom grids ='N3 with any probability (less than one)

d(8~) = o (N-l !2n ), N 400. (2.222)


Global Optimization Methods 43

For quasirandom grids :::Ni , i=4,5,6


(2.2.23 )

Besides, for 11-adic ITt-grids :::N6 with Nj==11j

The general term in the asymptotic expression (2.2.23) has the form B(n)N-l logn N
where

log B(n) = O(n log n), n -t 00

for grids::: N4 ;

log B(n) = O(n log log n), n -t 00

for binary IT't-grids and optimal values of 't= 't(n); finally,

B(n)-tO, n-t OO

for 11-adic ITo-grids with minimal values of 11= 11(n).


The above mentioned expressions lead to the following conclusions. Under the
criterion DN the best grid from the above mentioned is the 11-adic ITo-grid, while the
other quasirandom grids are also good. The cubic grid :::N1 is optimal for n=l, but is poor
for large dimensions; it is worse than the random grid for n~3. At the same time, for any
dimension n the order of convergence of {d( ::: N)} to zero for n-t oo is optimal for cubic
grids and equals N-l/n. For quasirandom grids this order is N-l/nlogN, N-t oo , i.e. it
is almost optimal.
The dispersion criterion d is judged as more significant than discrepancy in global
optimization theory. Numerical and theoretical studies have shown, however, that the
goodness of cubic grids for global optimization algorithms is questionable. For example,
Section 2.2.3 demonstrates that the dispersion criterion must not be the unique optimality
criterion applied to grids.

2.2.2 Sequential covering methods


The grid algorithms described above are the simplest covering methods. Their covering
strategy is independent of the computed values of f, i.e. these algorithms are passive. On
the contrary, sequential (active) covering methods use the information obtained during the
search: these algorithms are of great theoretical and practical interest.
44 Chapter 2

Let us consider first the main idea of covering methods supposing that they designed
for optimizing an objective function with a fixed accuracy 0>0, with respect to its values.
Let the function f be evaluated at points xl"",xk- We shall call the value

a record and a point xk* with the function value f ( xk *)= f k * a record point. Let us
define the set

(2.2.24 )

Obviously

Points ofZk are not of interest for further search since the record fk* in the set Zk can be
improved not more than by o. Consequently, the search may be continued on the set
X\Zk only. In particular, if

(2.2.25 )

then the initial problem is approximately solved, since fk*- f* ~o, i.e. the record point Xk*
can be taken as an approximation for x*.
Thus, the construction of a covering method is reduced to the construction of a point
sequence {xk} and a corresponding set sequence {Zk} until the condition (2.2.25) is
fulfilled for some k.
Let us note that the record sequence {fk*} is decreasing and the set sequence {Zk} is
increasing, i.e.

f *k+l ~ f *k and foraH k=1,2, ... (2.2.26)

Let us note also that the efficiency of covering methods depends significantly on the
closeness of records f k * and the optimal value f*, since the size of the sets Zk crucially
depends on this values. Consequently, to improve the efficiency of a covering method one
should use a local minimization technique just after obtaining a new record point. It gives
us an opportunity to decrease the incumbent records and, hence, to increase the size of the
sets Zk.
Global Optimization Methods 45

Let us consider the covering method for the case f'=Lip(X,L,p); methods for
functions which have a gradient satisfying the Lipschitz condition are constructed
analogously, see Evtushenko (1985), p. 472.
In the case f'=Lip(X,L,p) for every x,ye X one has

f(y)- Lp(x,y) ~f(x). (2.2.27)

Hence, the inequality

(2.2.28)

is valid for any xe X satisfying

f *k - B ~f(y) -Lp(x,y)

for some ye X. Hence, it follows that the inequality (2.2.28) holds for any
xeB(xj.11jk,P) where l~j ~k, 11jk=(f(xj)-!k*+o)/L. Thus, we may take

Zk = () B(x J·,11 J·k'P) = {x e


j=l
X: max [f(x.) - L p(x,x .)}~ f: -
1$;i ~ 1 1
o}.
(2.2.29)

For all k~l, a new point xk+ 1 can be obtained in different ways. For instance, to choose
new points one may use any grid point. Devroye (1978) proposed random grids for the
above aims, see later Algorithm 3.1.4.
In the class of covering methods based on (2.2.29) the method proposed by Piyavskii
(1967, 1972) and, independently, by Shubert (1972) is the most popular. This method
selects the point

xk
+
1 = argmin[
x
max [f(x.) -
l$;i$;k 1
L p(x, x.)]]
1
(2.2.30)

that is a minimizer of the Lipschitzian minorant


(k)
f (x) = max [f (x.) - L p(x, x.)l (2231)
l$;i$;k 1 1 J

of the objective function f at each k -th iteration. Minimization in (2.2.30) is done either on
X or on X\Zk.
Following the Russian literature, we shall call the method (2.2.30) a polygonal line
method. This nomenclature originates from the fact that (2.2.31) is a polygonal line, in the
case when n=1 and p is the Euclidean metric. The deficiency of the polygonal line method
lies in the complexity of the auxiliary extremal problems (2.2.30). Wood (1985) described
46 Chapter 2

a constructive (but cumbersome) multidimensional variant of the method for the Euclidean
metric case. Lbov (1972) proposed to solve the auxiliary extremal problems by the
simplest global random search algorithm. Lbov's variant of the polygonal line method is
itself a global search algorithm and similar to the Devroye (1978) algorithm mentioned
above. A scheme generalizing the polygonal line method was developed also by Meewella
and Mayne (1988); related but different approaches will be treated in Section 2.3.
To simplify the basic covering method, one may use some subsets Xk of Zk instead
of Zk. An algorithm using hyperrectangles as subsets was proposed and studiedt,by
Evtushenko (1971, 1985). The one-dimensional variant of the Evtushenko algorithm has
the form

Xl = a, (2.2.32)

Ifxk>b then the iteration (2.2.32) is terminated. Here X=[a,b] and the subsets X\Zk are
intervals. The most unfavourable case for algorithm (2.2.32) is the case of a decreasing
function f in which points (2.2.32) are at equal distances.
Brent (1973) proposed a similar algorithm for the one-dimensional functional class
f'={f: I f"(X) I:s;M} where M is to be known. There are other similar one-dimensional
algorithms in Beresovsky and Ludvichenko (1984), Vasil'ev (1988).
Evtushenko (1985) constructed a multidimensional variant of his algorithm (2.2.32).
The sets X \Zk in it are hyperrectangles and the sets Xk are unions of cubes with centres
at points being obtained by a one-dimensional global optimization algorithm.
Evtushenko's algorithm is cumbersome but it does not require great computer memory or
complicated auxiliary computations. The numerical study of the algorithm indicates that
for dimensions n>3 it requires an exhaustive number of evaluations of f, and thus is
inefficient.
There are strong objections against covering methods to be applied directly in the
multidimensional case. First, the methods are cumbersome and complicated to realize.
Second, their efficiency considerably depends on the prior information about f, i.e. on the
choice of the functional class f'. In the most important case f'=Lip(X,L,p), the functional
class f' is determined by a Lipschitz constant L and the metric p. The inclusion
fELip(X,L,p) for some constant L and some metric p is usually a plausible conjecture,
but the explicit (minimal) value of the constant L is generally unknown, depending also on
the choice of X, the metric p and variable scales. Section 2.2.3 will demonstrates that the
unfortunate choice of variable scales leads to inefficiency of optimal algorithms (even for
known Lipschitz constant L). The same criticism is valid for any covering method.
Third, the number of objective function evaluations is excessive in multidimensional
case. Let us analyse this number for the case f'=Lip(X,L). Let X be a ball with radius TI.
The volume of X is
Global Optimization Methods 47

where r denotes the gamma-function. Let xl, ... xN be the points of function evaluations
and M=max f. One may only guarantee that f(x»j* for XEX, if f has been evaluated at
a point xi from the ball B(x,(f(Xi)-f*)1 L). Hence the balls

B(Xi,(f(Xi)-f*)/L), i=l, ... ,N,

must cover X to assure that the global minimizer has not been overlooked. The joint
volume of these N balls is smaller than

(
M - f *)n n/2
N L 1t lr(n!2 + 1).

Thus, for these N balls to cover X we require the fulfillment of

N > (ll I(M - f *» n L n .


If the derivatives of f in the direction of the global minimizer do not equal (-L)
everywhere, then L>(M-f*)/ll: this way the computational effort required increases
exponentially with n.
Covering methods are theoretically built for the case :F=Lip(X,L,p) with known value
of the Lipschitz constant L. In practical optimization problems, however, this constant is
usually unknown. This deficiency is not crucial, since the Lipschitz constant may be
estimated during the search. While estimating the Lipschitz constant, one has to bear in
mind that if it is underestimated then the method will not be reliable and if it is
overestimated, then the amount of computational work needed to achieve a fixed accuracy
will exponentially increase.

Let f E Lip(X,L,p) be evaluated at k points xl ,... Xk. Then

Lk=llk max {If(x.)-f(x.)IIP(x.,x.)} (2.2.33)


l:::;i <j:::;k 1 J 1 J

is a usual estimator of the Lipschitz constant L. Here {llk} is a nonincreasing sequence of


numbers which exceeds 1 and are heuristically chosen. In case of llk=l, the values
(2.2.33) generally underestimate the constant L.
In the one-dimensional case the Lipschitz constant estimator is built more easily. To
provide it, one has to renumber the points x }. ... xk in their increasing order and then
compute

(2.2.34)
48 Chapter 2

2.2 .3 Optimality of global minimization algorithms

If the dimension n of the set 'X. or the required solution accuracy increases, then the
number of objective function evaluations will rapidly increase for every grid algorithm.
For example, for the cubic grid algorithm we have
n
N ~ (v'il/2E)

where € is the accuracy in the argument values, in Euclidean metric. Numerical results for
the covering methods described above prove that for n>3 all of them have poor efficiency.
In this connection the question of best possible covering methods is of (mainly theoretical)
interest.
Let the number of steps (i.e. evaluations of the objective function f) be a priori
bounded by a number N. Every deterministic global minimization algorithm
dN=(d 1,... ,dN+1) is determined by mappings

k = 1, ... ,N + 1
That is, first a point Xl is chosen then the points xk (k=2, ... ,N) can depend on all
preceeding arguments and corresponding function values: xk=dk(Lk), where

The estimator of an optimizer x* is the point xN+ 1=dN+1(LN+ 1). It is usually determined
by xN+ 1=xN* and its inaccuracy defined by

is taken as a measure of inaccuracy for method d N. The inaccuracy of a method d N on a


certain class :f' is usually defined by

N N
€(d ) = sup €<I,d ) (2.2.35)
fE1="

corresponding to the minimax approach for measuring the efficiency of algorithms. (Let
us remark right here that other approaches exist, too: e.g. the Bayesian efficiency measure
may lead to a class of stochastic algorithms, as will be described in Section 2.4.
Let 1l(N) be the set of all global minimization algorithms in which the number of steps
does not exceed N and ~(N) be the set of all N-point grid algorithms in which xN+1=xN *.
It is clear that ~(N) c 1l(N). An algorithm d*N is said to be optimal (in minimax sense),
if
Global Optimization Methods 49

d N*=arg N
.
mm e (d ) . (2.2.36)
N
d E1J(N)

A grid algorithm d(*)N is said to be the optimal grid algorithm, if

d N(*)= arg Nmm


. N
e (d ) . (2.2.37)
d E~(N)

An algorithm dNE :D(N) is said to be asymptotically optimal if

e(dN)-e(d~) for N ~oo, i.e.


lim e(dN)/e(d~) = 1.
N~oo

An algorithm dN is said to be optimal in order, if

does not exceed a constant for all N.

There exists another way of optimality definition of global optimization algorithms. Its
essence is in fixing a bound Eo for the inaccuracy (2.2.35) and then minimizing the
number of steps N in a set of algorithms dN satisfying e( dN)::;eo . This approach makes
possible to introduce definitions similar to those considered above.
Let us assume now that :F=Lip(X,L,p). Then Sukharev (1971, 1975) proved that an
optimal passive algorithm (see (2.2.37» is the grid algorithm built by the grid ::: N8
consisting of the centres of balls B(xi,e,p) that form a minimal cover of X. For this
algorithm, the guaranteed accuracy

is equal to the radius of the balls (forming the optimal cover of X) multiplied by L.
In the mentioned works it is also proved that the optimal passive algorithms (2.2.37)
are also optimal - in minimax sense - among the sequentie1 ones (2.2.36), that is

(2.2.38)
50 Chapter 2

(The maximum on f' is attained at a saw-tooth function whose values at all points
xl, ... ,xN are equal, details can be found in the papers cited.)
In terms of game theory, the above result may be explained as follows. A researcher
chooses a minimization algorithm which is known to his enemy (say, to nature or to an
oracle ). The latter selects then the most unsuitable function for the algorithm in f'. This is
the saw-tooth function mentioned, having a maximal rate of variation and equal values at
points generated by the algorithm. This way the enemy eliminates the possibility of
collecting valuable information concerning the objective function. So, passive algorithms
are not worse (in the worst case sense), than sequential ones.
The supposition of choosing the objective function by an enemy who knows the
optimization algorithm is, of course, doubtful in practical optimization problems. The
conclusion obtained (concerning passive algorithm optimality) causes additional doubts
concerning the adequacy of the minimax approach. Note in this context that this approach
is of great interest for uniextremal or convex functions, see Kiefer (1953), Chernousko
(1960); for multiextremal problems however, this approach calls for additional problem
specifications.
Sukharev (1975) also studied a different but related concept of best algorithms. To
construct these algorithms, he supposed that nature may deviate from its optimal strategy:
this concept leads to a certain sequential algorithm. It is constructed as follows: let the
number N of objective function evaluations be fixed and k evaluations have been already
done at points xl, ... ,xN' Then the point xk+l is chosen as the centre of a ball coming into
the optimal cover of the set X\ Xk, where

and 11 is the fixed lower bound of the radii of balls which form the optimal N-balls cover.
For the worst function the points xl "",xN generated by this type of algorithm coincide
with the optimal grid algorithm points. For the other functions, the accuracy of the
sequentially best algorithm may be better. But the main problem of Sukharev's algorithm
lies in its very complicated construction, since optimal covers ought to be built on every
step of the algorithm.
If N=pn, where p is a natural number, and the metric P=Po is cubic (Le. (2.2.13) is
fulfilled), then the cubic grid algorithm is the minimax optimal global minimization
algorithm for the Lipschitz functional class f'=f' o=Lip(X,L,po). The same algorithm is
also optimal for the more general functional class f' s, s~O, consisting of functions whose
derivatives up to order s are from f' 0 ' see Ivanov (1971, 1972). In the latter case, the
algorithm has a guaranteed accuracy of order O(L/N(S+ l)/n) for N~oo. Ivanov et al.
(1985) described another approach for the construction of an optimal in order algorithm in
f's. It consists of minimizing a spline that is built using the values of f at uniformly
chosen grid points.
As stated above, optimal (in the above sense) deterministic algorithms can be built
under the supposition of knowing the minimization algorithm by an enemy. The
researcher may make an attempt to increase his gain (that is, the guaranteed accuracy) by
using a randomized strategy. In this case it is supposed that nature does not know the
Global Optimization Methods 51

researcher's strategy, but knows its statistical characteristics. Sukharev (1971) considered
passive randomized algorithms aN for the case 1"=1"0. They are determined by the joint
probability distribution of N random vectors in X. We denote these distributions by
aN(d:::N). The mean accuracy of a randomized algorithm oN for a function f is the value

e(f,a N )= J e(f,EN)aN(dE N)·


XN

We call a randomized algorithm minimax-optimal, if it satisfies the relation

Sukharev (1971) proved that in the one-dimensional case, for X=[a,b], the optimal
randomized algorithm a*N is determined by a random choice of the grids

~
'::'N,l =
{o. 2N-l'
2 2N-2} d ~
... , 2N-1 an '::'N,2=
{l 3
2N-l' 2N-1 , ... ,
I}

with equal probabilities 0.5. This algorithm has the guaranteed mean accuracy

max e(f,aN ) = L/[2(2N - 1)].


fef"

Note that this value is almost twice less, than the guaranteed accuracy L/2N of the optimal
deterministic algorithm: the latter is the cubic grid algorithm for the grid

Sukharev (1971) also showed that for all n~l, the inequality

is fulfilled: by this inequality it follows that if the dimension n of X increases, the


improvement of the guaranteed accuracy will decrease.
The construction of optimal sequential randomized algorithms is interesting, but rather
complicated. The one-step optimal randomized algorithm is known only in the one-
dimensional case, see Sukharev (1981). The construction problem for this algorithm is
reduced to that of the optimal strategy in a matrix game on the unit square.
More information concerning minimax-optimal algorithms can be found e.g. in
Fedorov, ed. (1985), Ivanov (1972), Strongin (1978), Sukharev (1971, 1975, 1981) or
Schoen (1982).
52 Chapter 2

Let us note here that in spite of Shubert's (1972) assertion the polygonal line algorithm
(2.2.30) is not one-step optimal (as easily seen already for N=2); following Sukharev
(1981), a slight modification of the algorithm is one-step optimal. The sequential
algorithms will be studied later.
The aim of further study in this section is to show that minimax-optimal global
minimization algorithms may have very poor efficiency in realistic optimization problems.
The results below question again the practical significance of the above and similar
(minimax-type) optimality criteria and the optimal algorithms derived.
Let X=[O,I]n, f'=Lip(X,L) and let the objective function f depend on s variables
(s<n) with coordinate indices iI, ... ,i s (l.::;i I < ... <i~n) but not on all n variables. Let us
denote by K=K(il> ... ,i s) the corresponding s-dimensional cube, that is the s-face of the
cube X in which the variables of indices i I ,... ,is vary from zero to one and all others are
equal to zero.
A grid :::N on X induces the grid :::N(S) on K consisting of projections of the grid :::N
points onto X.
For the dispersion of a grid :::N(S) we have

(22.39)

(instead of (2.2.20» and the inequality

(2.2.40)

replacing (2.2.15).
In fact the case of a function defined on an n-dimensional set, but depending only on s
variables is a hardly probable case in practice. However, it is usual that an objective
function depends on all n variables, but the degree of dependence is different: there is a
group of essential variables that influences the function behaviour more intensively than
the others. In other words, in complicated cases one may expect that the objective function
f has the form f =h+g where h»g and h depends only on s variables (s<n). But in this case
the accuracy of a grid-type minimization algorithm depends on values d(:::N(s» for
different s<n, il> ... ,i s and not on the value d(:::N) only. Thus, the collection (d(:::N(s»}
for different s.::;n, iI, ... ,i s is a natural vector criterion of a grid algorithm. Let us use this
indicator to compare the cubic, random, and quasirandom grids.
If N=pn and :::N=:::N I is a cubic grid, the grid :::NI(s) contains pS distinct points only.
This gives

d(8~(S» = ~Vs N-l/ n , (2.2.41)

see Sobol (1982). The rate of decrease of the values d(:::N(s» for N-+oo is not influenced
by s. For small values s, this rate is much worse than the optimal rate N-l/s.
Global Optimization Methods 53

At the same time, for every N the projections::: N(s) of random::: N3 and quasirandom
i
::: N (i=4,5,6) grids contain N distinct points. So, for random grids with any probability
less than one there holds

N ~oo,

and for quasirandom grids

N~oo, i = 4,5,6.

These relations together with the right side of the inequality (2.2.39) imply that for
random grids (with any probability less than one) we have

N ~oo, (2.2.42)

and for quasirandom grids :::Ni , i=4,5,6

_i -1/ s
d(.:::. N(s)) = 0 (N log N), N ~oo. (2.2.43)

By (2.2.41) - (2.2.43) we arrive at the conclusion that the quasirandom grids qualitatively
surpass the random and cubic ones by the criterion d(:::N(s)) for all s<n and the random
grids surpass the cubic ones for s<n/2. The rate of decrease of d(:::N(s)) for N~oo is
nearly optimal for quasirandom grids for all s.$n.
Thus, for a functional subset :FcLip(X,L) under consideration the cubic grids are
worse (in the above sense) than quasirandom ones and even may be worse than random
grids. Recall that the cubic grid minimization algorithm is optimal for the case
:F=Lip(X,L,po) where Po is the cubic metric and it is optimal in order for:F= Lip(X,L).
We shall demonstrate now that a similar situation takes place when it is supposed that
f E Lip(X,L,p) , but in fact f E Lip(X,L,p 1) where p and PI are metrics for which
Lip(X,L,Pl)CLip(X,L,p ).
Let X=[O,l]n and fE:F l=Lip(X,L,Pl) where the metric P=Pl is defined by (2.2.14);
furthermore let

The relation fE:F1 follows by fE :F2= Lip(X,L,P2) where the metric P=P2 is defined by
(2.2.11). The condition fE:F2 is a typical condition of f in theory as well as in practice
(because the precise information about f is always absent and while formulating the
Lipschitz condition on f one usually chooses a metric from the collection (2.2.10) -
54 Chapter 2

(2.2.13) which contains metrics having the unit p-ball symmetry property discussed
earlier).
The relation fe:F2 means that
n
If(x) - f(z)1 $; L L Ix(i) - z(i)1
i=1

for all x,ze X; the relation f e :F 1 is equivalent to


n
If(x) - Hz)1 $; i=1
L L ·Ix(i) -
1
z(i)1 (2.2.44 )

where Li = ai L 5. L for all i=I, ... ,n. For any function fe:F2 the true (minimal)
constants L i exist. They are usually unknown but it is precisely them (not L) that
determine the true accuracy of a global optimization algorithm.
We shall suppose that apO for each i=I, ... ,n (the opposite case was considered
earlier) and introduce aO and alas the arithmetical and geometric mean values

According to (2.2.15) the value d Pl (=:N) is a natural quality characteristic of a grid =:N for
the case :F=Lip(X,L,Pl).
If apO for each i=I, ... ,n and the natural numbers n,N are arbitrary, then (see Sobol
(1987»

::' ) > ! ( l)lIn IN-lin


d p (...... (2.2.45)
N - 2 n. a
1

for any grid =:N,

::,1) _ .!..
d p (...... ON-lin (2.246)
N - 2 na
1

for a cubic grid =:N 1, and

(22.47)

for an T\-adic II't-grid =:N6 where the constant c does not depend on n, N.
The formulas (2.2.45), (2.2.46) lead to the following conclusions. If all the values Lj
in (2.2.44) are positive, then cubic grids are optimal in order, but the ratio of the right
Global Optimization Methods 55

sides of (2.2.45) and (2.2.46) can be arbitrarily small (for sufficiently large n). If some
values Li equal zero, then the cubic grids are not optimal in order (this corresponds to the
above case). If L 1= ... = L n then the ratio of the right sides of (2.2.45) and (2.2.46)
exceeds e- 1 for all nand N: this is again an evidence of high (theoretical) efficiency of
cubic grids for this case.
Comparing now the right sides of (2.2.45) and (2.2.47), we find that quasirandom
ITt-grids are optimal in order, uniformly with respect to values Ll =... = Ln. Since these
values are usually unknown and may be rather different from each other, uniformly
optimal (in order) grids are to be preferred to optimal grids for the case L 1=... = Ln. In
particular, according to the above criterion, nt-grids are better than cubic ones (though in
the classical minimax sense the opposite judgement holds).
56 Chapter 2

2.3 One-dimensional optimization, reduction and partition techniques


As it is clear by now, the multidimensional multiextremal optimization problem is a fairly
complicated issue in computational mathematics. It is natural that some authors have made
attempts to reduce the problem to simpler ones. In Section 2.1 we considered methods
based on using local optimization techniques. Mathematical statistics procedures are
widely used in the global random search methods: these will be investigated in Chapter 4.
This section considers methods of global multidimensional search based on partition
techniques and on the use of one-dimensional methods, and algorithms deducing the
optimization problem to other problems of computational mathematics (approximation,
interpolation, evaluation of integrals, etc.). First we shall review some well-known
methods of one-dimensional global optimization.

2.3.1 One-dimensional global optimization

We shall suppose below that X is an interval X=[a,b], -oo<a<b<oo.


Much attention in the literature has been paid to the one-dimensional global
optimization problem: this is associated with the relative simplicity of the problem and also
with the existence of a number of multidimensional algorithms based on mUltiple use of
one-dimensional optimization methods.
Many different one-dimensional global optimization algorithms exist: they are known
to be practically efficient under various a priori conditions concerning the objective
function.
For the most frequently investigated case f'=Lip(X,L,p), where the Lipschitz
constant L is known, a number of methods have been described in Section 2.2.
When f'=Lip(X,L,p) and the Lipschitz constant is not known, it may be estimated
e.g. by formula (2.2.34): this way, proper modifications of the above methods are
applicable. For some classes of functions, being more smooth than Lipschitzian,
algorithms similar to the above mentioned one-dimensional methods have been built by
Brent (1973), Ganshin (1977), Berezovsky and Ludvichenko (1984).
In principle one may construct more or less efficient one-dimensional algorithms,
using any general global optimization technique. Some methods obtained in this way (e.g.
global random search methods) are relatively inefficient in one-dimensional case. Others
seem to be the most efficient in the one-dimensional case (including the majority of
covering, interval and Bayesian methods).
There exist many global optimization algorithms which are applicable only in the one-
dimensional case. For example, Batishev (1975) and Pevnyi (1982) considered algorithms
based on a spline approximation of the objective function; Jacobsen and Torabi (1978)
reduced the problem of one-dimensional global optimization of a function represented as a
sum of convex and concave functions to a sequence of convex functions minimization
problems.
The following general scheme comprises a considerable number of (one-dimensional)
adaptive partition algorithms in global optimization.
Algorithm 2.3.1. (Pinter (1983))

1. Set xl=a, x2=b, k=2 and evaluate t(xt}, f(x2).


Global Optimization Methods 57

2. Renumber in the increasing order k points xb ... ,xk with objective function values
f(Xi), i=l, ... ,k. Then we have a=xl < X2 < ... < xk=b.

3. For all subintervals Lli=[xioxi+l], i=l, .. ,k-l, of the interval 'X=[a,b] calculate the
interval characteristic function value

which depends on the vertices of subinterval Lli, the best function value f k * found so for,
and function values at the vertices. Choose the index (one of the indices)

j = arg max :R..(i)


1::;; i::;;k-l

4. If the length of subinterval Llj is less than or equal to a fixed small positive number £
(that is Xj+ l-Xj~€) then the algorithm will stop (other numerical stopping criteria may also
be used). Otherwise take xk+ 1=S(j), where

is a function independent of the objective function values not belonging to the subinterval
Llj except the value fk * (the same property has the function :R..).

5. Evaluate f (xk+ 1), substitute k+ 1 for k and return to Step 2.

The convergence of the above general partition scheme was investigated in Pinter (1983)
and was generalized for the multidimensional case in Pinter (1986a,b).
Let us consider the forms of functions :R..,8 generated by some well-known
algorithms. To obtain Strongin minimization algorithm (see Strongin (1978» under the
supposition f E Lip ('X,L ), one should choose

(2.3.1)
8(j)= (x.+ x. 1)/2- (f(x. 1) - f(x.») /(2[,). (2.3.2)
J J+ J+ J

In case ofF=Lip('X,L,p) the functions :R..,8 for the Strongin algorithm are as follows
58 Chapter 2

(23.3)
X,+X'
J J+ I 1 -I (If(x.I)-t<x·)I)
J+ J
8(j)= 2 -2sign(f(Xj+I)-f(Xj»11 L
(23.4 )

where 11 (z)=p(z,O), 11- 1 is the inverse function of 11. The polygonal line method of
Piyavskii-Shubert (2.2.30) in case off'=Lip(X,L) is obtained by defining

R(i)=L(x.1+ l- x 1.)/2-(t<x·)+f(x. 1»)/2 (2.3.5)


1 1+

and determining 8 by (2.3.2). (Note that - as it can be seen easily - if one wants to
maximize an objective function f, then it is neccessary to change the sign minus to plus in
the last items in (2.3.1) - (2.3.5).) Note also that the one-dimensional algorithms
presented later in Section 2.4 can also be represented in the frames of Algorithm 2.3.1.

2.3.2 Dimension reduction in multiextremal problems

There are ways of reducing multidimensional multiextremal optimization problems to one


or several optimization problems having smaller dimension (in particular, dimension 1).
These approaches will be outlined below.
The simplest way of using one-dimensional global optimization techniques is to utilize
the scheme of coordinate-wise optimization similarly to some approaches in local
optimization theory, see Mockus (1967). Certainly, a coordinate-wise global optimization
algorithm can not guarantee, in general, that its limit point is a global minimizer. As a
consequence such algorithms are not popular.
A number of multidimensional global optimization algorithms apply one-dimensional
search in randomly chosen directions. Consider e.g. the following.

Algorithm 2.3.2. (Bremermann (1970»

1. Draw a point xIEX arbitrarily, evaluate f(XI), set k=1.

2. Choose a random line passing through the point xk (or choose a random direction
emanating from xk) by generating an isotropic probability distribution with centre xk (for
instance, the uniform distribution on the surface of the unit sphere 8={XE Rn: II x-xkll =1}
can be used).

3. Choose five equidistant points on the above line with Xk as their middle point. Evaluate
the objective function at the above points.

4. Construct fourth-degree polynomial interpolation on the line.


Global Optimization Methods 59

5. Obtain a third-degree polynomial by differentiating the above fourth-degree one.


Calculate the zeros of the third-degree polynomial by the Cardano formula. Evaluate the
objective function at these calculated zeros which belong to X.

6. Choose xk+ 1 as the point with the smallest objective function value in the point set
containing five points from Step 3 and not more than three ones from Step 5.

7. Substitute k+ 1 for k and return to Step 2.


The convergence of Algorithm 2.3.2 is guaranteed only for the case, when the objective
function is a polynomial of order not higher than four. In particular, it may be applied for
evaluating roots of any polynomial P(z) by reducing the equation P(z)=o to a system
Qi(x)=O, i=l, ... ,m, of second-degree simultaneous equations for new variables x and
finding the minimizers of the function
m 2
:LQ.(x).
i=l 1

Note that there are many algorithms similar to Algorithm 2.3.2. The essential part of the
method consists of a fourth-degree polynomial approximation of the function

on the interval [-1,1] for a. where ~k is a realization of a random vector having an


isotropic distribution in:R n with its centre at zero. Gaviano (1975) studied a theoretical
version of the above algorithm whose iterations involve the minimization of the one-
dimensional function jk(a.) over the interval [0,1]. Computational experiments of some
authors showed that better results would be obtained if the range of a. is extended. The
simplest global random search method (Algorithm 3.1.1) may be referred to as a
modification of Algorithm 2.3.2 in which minimization of the function fk(a.) is carried out
on the two-point set {O,l}.
The most popular theoretical scheme of dimension reduction is the so-called multistep
dimension reduction based on the representation

min j(x) = min ... min j(x(l), ... ,x(n» (2.3.6)


XE X. O~X (l)~1 O~x(n)~l

where j is a continuous function on the unit cube X=[O,l]n and the point XE X has the
coordinates x(i), i=l, ... ,n. In particular, for n=2 the representation (2.3.6) gives

min j(x,z) = min <p(x)


O~x,z~l O~x~l

where
60 Chapter 2

<p(x) = min f (x, z).


O~z~l

One may use the formula (2.3.6) for reducing the original optimization problem on a cube
to a number of one-dimensional global optimization problems (but usually this number is
very large). If n~3 and the functional class f" is broad enough (such as Lip(X,L», then
the algorithms based on (2.3.6) are cumbersome and inefficient. But for some relatively
narrow classes f" the multistep reduction scheme may serve as the base of efficient
algorithms. For example, if the objective function f is separable, i.e.
n
f(x) = 'Lf·(x(i»
i=l 1

where f b···,f n are one-dimensional functions, then according to (2.3.6) one has

n
min f(x) = 'L min f. (x(i».
XE'X i=l O~x(i)~l 1

In this case to solve the n-dimensional optimization problem means to solve n one-
dimensional ones. In a practically more important case, f is represented as follows

n-l
f(x)= I f.(x(i),x(i +1» (23.7)
i=l 1

where f b"',f n-l are two-dimensional functions. Here (2.3.6) reduces to

min f(x) = min <pn(x(n»


XE 'X O~x (n )~l

where

<P2( x(2» = min f l(x(l), x(2», <p. (x(i» = min [<p. _l(x(i - 1» + f . _lex (i - l),x(i»]
x(l) 1 x(i-l) 1 1

for i~3. Hence, to solve a global optimization problem for the objective function (2.3.7),
one may tabulate one-dimensional functions <Pi (i=2, ... ,n) whose values are solutions of
one-dimensional global optimization problems. A more detailed description of the
multistep reduction scheme is contained in Strongin (1978).
Another way of reducing multidimensional global optimization problems to one-
dimensional ones is the dimension reduction by means of Peano curve type mappings.
These are continuous maps <P of interval [0,1] into the cube X=[O,l]n. Using them one
has
Global Optimization Methods 61

min f(x) = min f(<p(t)).


XE'X tE[O,l]

The possibility of using one-dimensional algorithms for optimizing the objective functions
g(t)=f(<p(t)) is followed by the next proposition due to Strongin (1978): if fELip(X,L),
then gELip([O,l],Mn(L),Pn) where Mn(L) is a constant depending on L,n,<p and Pn is
the metric defined by formula

Vn
Pn(t,t')=lt-t'1 .

If <p is the Peano curve then


Mn(L) = 4Lvn.

Strongin (1978) thoroughly studied the numerical construction of approximate Peano


curves and their properties. The author has no convincing data in favour of efficiency of
corresponding global optimization algorithms in case of n>2. The main difficulty here is
that the one-dimensional function g(t) may be so complicated that its minimization may
give greater difficulties than the minimization of f. (Already for n=2 and a linear function
f, the corresponding one-dimensional function g is essentially multiextremal).
Nevertheless, it is doubtless that the approach may be useful in a sort of visual analysis of
the objective function behavior.
Saltenis (1989) analysed a statistical approach to the study of possibilities of objective
function representations, as a sum of functions depending on smaller numbers of
variables. This approach is based on using the so-called random balance method which is
classical in the theory of screening experiments, see Ermakov and Zhigljavsky (1987).
The above approach uses the dispersion of different projections of an objective function f
or integrals of f with respect to different subsets of variables as structure characteristics.
Although the above approach is heuristic, hardly permitting serious theoretical
investigations, there are several examples of successful applications to practical problems.

2.3.3 Reducing global optimization to other problems in computational mathematics

The multidimensional global optimization problem is closely connected with a great


number of other problems in computational and applied mathematics. This connection
becomes apparent in constructing some algorithms or in immediate reduction of the global
optimization problem to other problems. Let us outline principal approaches.
Various global search methods based on multiple use of local algorithms have been
dec sri bed in Section 2.1. Some of them may be generalized for the case, when the
objective function is evaluated in presence of random noise. In this case the role of local
algorithms is played by stochastic approximation type procedures, see Section 8.1.
Section 2.3.2 described multidimensional methods of global optimization based on one-
dimensional search methods.
Any global optimization problem, in general, may be reduced to an approximation
problem. While doing this, one should obtain a global minimizer of an approximation
function and then utilize a local descent routine from the above minimizer as its initial
62 Chapter 2

point. If splines are used for approximation in the above methods then they may possess
some optimality properties, see Pevnyi (1982, 1988), Ganshin (1977). In spite of this, the
mentioned methods have no significant practical importance, because the goodness criteria
for optimization and approximation algorithms are different in essence (viz. approximation
accuracy has to be assured for all the set 'X., while in optimization in the vicinity of a
global optimizer only). Numerical investigations confirm this. For example, Chujan
(1986) searched the global optimizer of some one-dimensional functions several hundred
times more economically than the mentioned Pevnyi (1982) algorithm which is based on
utilizing splines.
Ideas of approximation are fruitfully applied in global optimization methodology in the
following way: a rough approximation of the objective function is constructed (not
necessary in an explicit form); nonpromising subsets of 'X. are determined by this
approximation and are excluded from further search 'X.; in the remaining subset of 'X., a
more accurate approximation is built, and analogous operations are carned out until a
given accuracy is attained. A considerable part of global random search algorithms treated
in this book are constructed according to this principle.
Another way of applying approximation in global search is to construct algorithms
using multiple approximation of the objective function projections (they may be one-, two-
or more dimensional): a typical example is Algorithm 2.3.2.
Global optimization algorithms, consisting of the solution of ordinary differential
equations or systems of such equations are important as well: these algorithms were
described in Section 2.l.3. The connection of differential equations and global
optimization theories is based on the fact that trajectories corresponding to solutions of
certain classes of differential equations, contain (or converge to) one or more local
optimizers of a given function.
Another type of connection is between stochastic differential equations and global
optimization theories. Section 3.3.3 shows for some functional classes 'f that if e(t)
approaches zero slowly enough as t tends to infinity then the trajectory corresponding to
the solution of the stochastic differential equation

d x(t) =- V f(x(t» + E(t)d w(t), xeD) = xo'


converges in probability to a global minimizer; here Xo is an arbitrary point in 'X. and w is
the n-dimensional Wiener process. This approach has a considerable theoretical interest.
Global optimization and certain integral equations are also related. Section 5.3 deals
with probability measures that are solutions of integral equations of some types and
concentrated in the vicinity of a global maximizer of the function f: hence, the algorithms
of generating random vectors distributed approximately by the above probabilistic
solutions may be referred to as global optimization algorithms.
Global search based on using integral representations is a popular approach, especially
in theoretical works. Some well-known representations (formulated for the problem of the
global maximization) are as follows.
Let 'X. be a compact subset of R. n, the function f be continuous on 'X. and attains the
global maximum M=max f at the unique point x* with coordinates x*G), j=I,... ,n. Then,
according to Pinkus (1968), we have
Global Optimization Methods 63

x*(j) = lim Jx(j) exp {Af (x) }dx/f exp {A,f(x)} dx. (2.3. 8)
A~OO

If, in addition, f is nonnegative, then

x*(j) = lim Jx(j) f\x)dx;J f\x)dx (2.3.9)


A~OO

and

M= lim JfA+\x)dx/Jf\x)dx. (2.3.10)


A~OO

General conditions, under which relations (2.3.9) and (2.3.10) hold, will be given in
Section 5.2.2.
If f is a continuous nonnegative function then for any A>O the evident inequality

Jf A+l (x)dx:5: MJf A(x)dx


is valid: an equivalent fonn of it being

J [f(x)/f(x*)] A+l dx:5: J [J(x)/f(x*)] Adx (2.3.11)

Hence, if for some A>O, X*E X the inequality

(23.12)

(opposite to (2.3.11), is fulfilled, then the point x* can not be a global maximizer of f.
The condition (2.3.12) is sufficient for a point not being a global maximizer. It is non-
constructive and thus seems to be of small practical significance. Namely, if evaluating the
integrals in (2.3.12) one finds a point X(o)EX such that f(x(o»f(x*) then x* is surely not
a global maximizer and if such a point x(o) is not found then the inequality (2.3.12) for
estimators of the integrals will not be valid.
Analogously to (2.3.11) and (2.3.12), a necessary and sufficient condition

lim sup Hf(x)/f(x*)] A dx <00


A~OO

for a point x* to be a global maximizer can be obtained (this condition is also non-
constructive).The representations (2.3.8) - (2.3.10) are more constructive: they are basic
for some global optimization algorithms (e.g. see Ivanov et al. (1985» involving
simultaneous estimation of several integrals. (Note that the problem of optimal
simultaneous Monte-Carlo estimation of several integrals will be studied in Section 8.2.)
64 Chapter 2

Still, it is difficult to understand why a point with approximate limits of (2.3.9) as


coordinates should be better than the record point obtained during estimation of the
integrals.
It should be noted that the asymptotic representations (2.3.9) and (2.3.10) are useful
for theoretical investigations of some global random search algorithms, see Section 5.2.
Let us point out also that besides (2.3.8), an equivalent asymptotic representation

x*(j) = lim Jx(j) exp { - J


A. f(x)} dx/ exp { - A.fCx)}dx
A.~oo

for the coordinates of the unique global minimizer x* of f is widely known.


Section 2.1.6 describes a method of minimization of a smoothed function that is close
in spirit to the methods based on the above integral representations.
Ideas of discrete optimization proved to be useful for some global optimization
problems. Let us outline some principal ways of their use. The first and the most evident
way is the discretization of the set X (replacing X by a discrete subset) and using a
discrete algorithm for optimizing the corresponding discrete function. In particular, this
approach was used by Nefedov (1987) for constructing an optimization algorithm for the
case when a set X is determined by inequality constraints. The second way (see Kleibom
(1967), Bulatov (1987)) is based on constructing the convex hull Xo for X and the
convex envelope f 0 for f and minimizing f 0 on Xo with the help of discrete optimization
methods. The third and the most useful way is the use of the branch and bound principle
that is a powerful general instrument in discrete optimization. This principle is considered
in the next section.

2.3 .4 Branch and bound methods

The main idea of branch and bound strategies is the sequential rejection of those subsets of
X that can not contain a global minimizer and then searching only in the remaining
subsets (regarded as prospective ).
At the k-th iteration of a branch and bound method it is necessary to construct a
partition (i.e. branching) of the optimization region (at the first iteration this region is X)
into a finite number Ik of subsets Xi, on which lower bounds ti of

mi= inf f(x) (2.3.13)


xe'X.
1

can be given by evaluations of f at some points from 'Xi, ie Ik.


At each iteration the record f k * =f (xk*) is also used: this is the smallest objective
function value obtained so far, thus being an upper bound for f* =min f.
Since subsets Xi for which t i ~ f k * can never contain a global minimizer, thay can be
excluded from the further search; all subsets Xj for which tj < fk* are left. The partition
is then further refined, most naturally by branching the subset Xj with
Global Optimization Methods 65

L= minL
J . I I
IE k
into smaller subsets - and the iterations are continued.
The convergence problem and implementation aspects for the above class of (global
optimization) methods is investigated under different conditions, see Horst (1986), Horst
and Tuy (1987), Pinter (I986a, 1988). Convergence is ensured by the fact that the lower
bounding procedure is aymptotically accurate, i.e. ~i converges to mi, when the volume
of Xi approaches zero.
Of course, for too broad functional classes, finding a lower bound for the global
minimum is of similar difficulty as finding the global minimum itself. Thus it is possible
to construct efficient branch and bound methods only for sufficiently narrow functional
classes 1=": examples of such classes 1=" are considered below.
There are many variants of the technique under consideration. As noted, e.g. the
majority of covering methods may be referred to as branch and bound procedures, see
Horst and Tuy (1987). The same is true for the one-dimensional partition algorithms
described in Section 2.4.1. Here we shall deal with other algorithms.
First let us follow McCormick (1983) and assume that X is a hyperrectangle

and f is a factorable function, i.e. f is the last one in a collection of m functions


fI,f2, ... , fm which is called a factorization sequence and built up as follows
j
f (xl' ... 'x n )= Xj for each j = 1, ... , n , (2.3.14)

and for j>n one of the following holds

j p q
f (x) = f (x) + f (x) for some p,q < j, (2.3.15)
j p q
f (x) = f (x)f (x) for some p, q < j, (2.3.16)
or
j p
f (x) = <p(f (x» for some p < j, (2.3.17)

where <p belongs to a given class <I> of sufficiently simple functions <p:R-tR (e.g.
<p(t)=tU , <p(t)=e t, <p(t)=sin t, etc.).
It is easy to verify that the above factorization is a natural way for representing
functions which are given in an explicit algebraic form.
Let the subsets Xi be hyperrectangles

X.I = {x E X:a.I ~x ~b.}.


I
66 Chapter 2

A lower bound t i for mi will be computed by constructing convex lower bounding


functions iij(X) and concave upper bounding functions Uij(X) with the property

j ,..i .
t .(x) S; J (x) S; u~ (x) for all x EX.1 (2.3.18)
1 1

and computing

~i = inf t ~(x).
XE'X.
1
One may use different ways for constructing convex functions t ij(X) and concave
functions Uij(X) with the property (2.3.18).
The simplest interval arithmetic methods described later use constant functions
tij(X)=ti and Uij(X)=Uij . The opposite approach is to find the best possible bounding
functions, i.e. taking tij(X) as the convex lower envelope of fj(x) on Xi and Uij(X)=Uij
equal to its concave upper envelope. For functions (2.3.14) we have
j . ,..i
t .(x)= uJ.(x) = J.(X), XE X 1..
1 1 1

If (2.3.15) holds, then

Let (2.3.16) hold, LiP, Liq and ViP, Vi q be lower and upper bounds for fP(x) anf fq(x)
over Xi, respectively. If LiP,:::;O, Liq.:::;O and ViP~, Viq~ then we may take

j { q p p q p q q p p q p q}
t.(x)=max V.t.(x)+V.t.(x)-V.V., L.L(x)+L.t.(x)-L.L. ,
1 11 11 1111 11 11
j
u.(x)=mm
. { p q q p
L.u.(x)+V.u.(x)-L.V.,
p q q p p q
L.u.(x)+V.u.(x)-L.V ..
q P}
1 11 11 1111 11 11

(Analogous results are valid for the other cases.)


In case of (2.3.17) convex lower and concave upper bounding functions (L(t) and
Vet), respectively) for <p over the interval [LiP, ViP] are to be given. Let <pet) attain its
minimum and maximum on the interval [LiP, ViP] at to and tl, respectively, and let mid
be the operator which selects the middle value. Then in case of (2.3.17) we may take

t(mid {t ~ (x), u~(x), to}),


t j i (x) =

uj/x) = u(mid {t ~ (x), u~(x), t 1 }).


Global Optimization Methods 67

It has been mentioned that constant functions ~ij(X) and Uij(X) satisfying (2.3.18) generate
so-called interval methods. Let us consider these methods which are of considerable
theoretical (and of increasing practical) importance.
Interval methods are aimed at finding the global extremum of a twice differentiable
rational objective function I defined on a hyperrectangle 'X and having a gradient VI and
a Hessian V 2I with only a finite number of (isolated) zeros. Their essence is in evaluation
of images I (Z), V I (Z), and V 2 I (Z) for hyperrectangles ZC'X with the purpose of
excluding those which can not contain extremal points. Their main drawback seems to be
the relatively restricted class of optimization problems which can be solved by these
methods, and their substantial computational demand.
Let us introduce some notions which are necessary for describing interval methods.
Let Zic'X, i=1,2, be intervals Zi=[ai,bi]. We shall call them interval variables and define
the interval arithmetic operations by

Zl + Z2 = [al + a 2,b l + b 2], ZI - Z2 = [a l - b 2 , b i - ail,


Z IZ 2 = [min( a 1a 2' a I b 2' b I a 2' bIb z), max( a I a 2' alb 2' b I a 2' bIb z) ] ,
Z/Z2 = ZI[l/b 2, l/a 2} if 0 e Z2'

U sing these formulas, one may evaluate the interval extension of a rational function f, i.e.
the image

f(Z) = {y: y = f(x), x E Z}

which is an interval for an interval argument Z.


If I is a multidimensional function, then interval values I (Z) may be analogously
defined for a multidimensional interval (i.e. hyperrectangle) Z. Algorithms of interval
values evaluation are studied e.g. in Moore (1966) or Ratschek and Rohne (1984).
Returning to global optimization via interval methods, for simplicity, consider the one-
dimensional case. The first step of an interval method is the evaluation of the objective
function at one of the interval endpoints. Let k?::l steps be already done; f k * be a k-step
upper bound for 1*
(e.g. fk * be the minimal value of I obtained so far); 'Xk be a subset
of 'X that surely contains an optimal point, consisting of a union of intervals. The next
step of the method is the following: let us choose a subinterval Z of 'Xk and divide it into
two parts Zl, Z2, and calculate the interval values f( Zl ), f( Z2). If fk*<J< Zi) then
subinterval Zi is excluded from the set 'Xk, since the above relation guarantees that the
objective function can not attain its global minimum on the subinterval Zi . The
subintervals Zj (i=1,2) may be also excluded from 'Xk ifOe f( Zj) or f"( Zj )C(-oo,O)
(as they imply that Zi does not contain a stationary point of I or is concave in Zi ). If the
subintervals Zl and Z2 are not excluded from 'Xk they should be included into 'Xk+ 1
replacing Z. Note that to find a zero of f' on an interval Zl (or to verify the inclusion
68 Chapter 2

Oe f( ZI )) one may use the interval version of the Newton method. IfOi: f"( ZI ) , then
the interval Newton step has the form

N(Z.) = z.J - f(z.)/f"(Z .), Z. 1 = Z .rlN(Z.)


J+ J J
j = 1,2, ...
J J J

where Zj is the midpoint of Zj. According to Hansen (1979), ifOi: f( ZI ), then the set Zj
is empty for some j and if OE f' ( Z 1 ), then the point sequence { Zj} converges to a
stationary point of f quadratically.
In the multidimensional case, the interval methods have the same form but their
practical realization is complicated, because of the necessity to store, choose and divide a
great number of subrectangles of 'X.
A detailed description of interval global optimization methods may be found in Hansen
(1979, 1980, 1984), Ratschek (1985). Mancini and McCormick (1976) described interval
methods for minimizing a convex function. Shen and Zhu (1987) suggest an interval
version of the one-dimensional Piyavskii and Shubert algorithm (2.2.30). On the whole,
interval methods represent a set of promising global optimization approaches but the class
of extremal problems which may be efficiently solved by them is naturally restricted by
their analytical requirements.
Another class of problems, in which branch and bound methods are used
advantageously, consists of concave minimization problems under convex constraints (for
details and references see Pardalos and Rosen (1986, 1987)). Following Rosen (1983),
consider the special case where

f(x) = Tl'X - ~x'Qx (2.3.19)

(here 11 is a vector, Q is a positive semidefinite matrix) and

(2.3.20)

Lower and upper bounds of f* are needed. To compute an upper bound at the beginning,
f is maximized over 'X. This gives a point XOE 'X. Then the n eigenvectors el> ... ,en of
the Hessian at Xo are determined. To move as far away as possible from Xo, one solves 2n
linear programming problems and find vectors

v . = arg max w'. x


1 xe'X 1

where wi=ei, wn+i=-ei for i=I, ... ,n. An upper bound for f* is U=min{ f(Vl), ... ,
f(V2n)}'

The vectors Vi define halfspaces


Global Optimization Methods 69

whose intersection is a hyperrectangle 'Xo containing 'X. A lower bound L for f* is the
minimum of f over 'Xo which is attained at one of the 2 n vertices of 'Xo.
We also construct a hyperrectangle inscribed into the ellipsoid

of the form

2n{ xeR n:w'.(x-xo)~d. }


(J
i=l 1 1

where the constant di can easily be computed. Now, it can be seen that x* can not be
contained in the interior of this hyperrectangle and the intersection of its exterior with 'X
defines an appropriate family of subsets in which x* is looked fOf.
Note that the branch and bound technique was used also for some more general
problems than (2.3.19) - (2.3.20), including the case, in which f is the difference of two
convex functions and 'X is a convex set.
Let us also note that Beale and Forrest (1978) applied the branch and bound technique
for minimizing one-dimensional functions of the type
m
f(x) = L f .(x)
i=l 1

where fi (i=1, ... ,m) are twice continuously differentiable, the values fi (x), fi' (x), fi"(X)
can be calculated for any point xe'X , and the set 'X can be a priori divided into subsets
where all second derivatives are monotone.
Finally it should be pointed out that Chapter 4 presents a generalized branch and
bound principle for the case when estimators for the lower bounds (2.3.13) are valid only
with a large probability: this generalization will be called the branch and probability bound
principle.
2.4 An approach based on stochastic and axiomatic models of the
objective function
In the previous sections, many deterministic models of the objective function were
considered. Other classes of models are also used for the description of multiextremal
functions and construction of global optimization algorithms: the most known of these is
the class of stochastic models that uses a set of realizations of a random function as :F.

2.4.1 Stochastic models

Let cp(x,O) be a random function where XE X and 0) is an element of a probability space


.0; the prior information about J consists in that JE:F={ cp: X x.o~Rl}. In other words,
J is supposed to be a realization of a random function cp for some random element 0):
J (x)= cp(x, 0) for all XE X and some O)E .0. Such models are not evident in advance, but
sometimes they are convenient from the mathematical point of view and can be justified
with the help of the axiomatic approach reviewed below.
Frequently, classes of Gaussian random functions are considered, i.e. random
functions cp(x) such that for each k~I and a collection 8k={xI,,,,,xk} of points in X the
random vector cp(8k)=( cp(x 1)"'" CP(xk»' has a joint Gaussian distribution with density

(det V)1!2 1
p(u,I!,V)= k/2 exp{-2:(u-I!)'V(u-I!)}
(21t)

where

k
uER, I! = (I!(X 1),···,I!(X k »', I!(X) = Ecp(x),
k -1
R=II T)(xi,xj)11 ,T)(x,z)=E(cp(x)-I!(x»(cp(z)-I!(z», V=R .
i,j=l

The class of realizations of the classical Wiener process determines the most popular
stochastic model of one-dimensional functions f (x), XE X=[O, 1]. It is characterized by the
functions

I!(X) = I! = const , T)(x. z) = cr2 min (x. z) (2.4.1)

where cr2 is a constant. In this case cp(O)=I!,

cp(X) - cp(z)-N(O, cr4x - z/).


Global Optimization Methods 71

In the case of Gaussian random functions, the marginal distributions conditioned by any
number of calculated values of f are still Gaussian and can be computed in the following
way. Let Y 1 =(y 1 ,··.,Yk)' be the vector of values Yi= f (xi) (i=l, ... ,k),
Y2=(<P(zl), ... ,<p(zm»' be a Gaussian random vector of unknown values of <p at points
Z 1,... ,zm in X. conditioned by the evaluations Y 1. Set

cov( Y
y 1) =
R
= (Rll Rl~
R R '
2 21 2

where V 11 and R11 are of order kxk, V22 and R22 of order mx m. Then

(2.4.2)

(2.4.3)

Formulas (2.4.2) and (2.4.3) are usually applied in the case m=1. For some particular
cases of covariance function l1(x,z), they are not very complicated. For example, if m=l,
ll(x)=O, and

for X,Z EX. (2.4.4)

then (2.4.2) can be simplified to

k 2
i~1(l/(1+llz1-xdl ) (2.4.5)

where (lb ... , (lk are appropriate constants.

2.4.2 Global optimization methods based on stochastic models

Let us use the notations of the beginning of Section 2.2.3, but apply the Bayesian
(statistical) approach for defining the accuracy of a method, replacing the minimax
approach based on (2.2.35).
The accuracy of an N-point method dN=(dl, ... ,dN+l) can be defined in various
statistically meaningful ways. For instance, the algorithm defined by

arg Nrnin E(<p(x N+1 » (2.4.6)


d e1>(N)
72 Chapter 2

is called optimal with respect to the expected value (E-optimal), and the algorithm

arg Nmax Pr{q>(x N +1)-min q>~E} (2.4.7)


d eD(N)

is E-optimal in probability (or P-optimal). The computational difficulties of finding them


are tremendous, see Archetti and Betro (1980), Mockus (1988), and so they are practically
intractable. Therefore the concept of one-step optimality is often used, instead of
optimality with respect to all points of a method. In a one-step optimal algorithm each
point is selected as if it were the last one: below we shall present two such methods which
are respective modifications of the above E- and P-optimal algorithms.
Let k evaluations of the objective function f at points xl ,... ,xk be performed, Yi=f (Xi)
i=l, ... ,k,

be the optimum estimate. Then the one-step E-optimal algorithm is defined by

(24.8)

where I-k denotes the conditions q>(xO=Yi for i=I, ... ,k. Furthermore, the one-step
analogy of the P-optimal algorithm (2.4.7) is

xk+l =arg~~ Pr{ q>(x) <f~ -EJI- k } (24.9)

where {Ek} is a suitably chosen sequence of positive numbers.


In the case, when q> is a Gaussian random function, the calculations in (2.4.8) and
(2.4.9) can be performed applying (2.4.2) and (2.4.3). They are computationally tractable
only for some special classes :F: the most well-known is the case of the one-dimesional
Wiener process model considered in Section 2.4.3. If X is multidimensional, then it is
really hard to find a reasonable stochastic model that would not lead to a tremendous
amount of computations. The situation is still worse, if the fact that a reasonable model
usually contains unknown parameters is taken into account. The multidimensional Wiener
process is already not suitable, from the numerical point of view.
An example of another kind is (2.4.4), in which due to (2.4.5) points (2.4.8) can be
found by calculating all roots of a set of polynomial equations (of course, the adequacy of
the model (2.4.4) to a given objective function is usually rather questionable).

2.4.3 The Wiener process case

Let X=[a,b] and q>(x,ro) is the Wiener process with mean and covariation function defined
by (2.4.1), where Il and (}"2 may be unknown. It is an acceptable model for the global
Global Optimization Methods 73

behavior of a complicated one-dimensional function and leads to not very cumbersome


calculations.

If 0-2 is unknown then every algorithm has to start with its estimation. To this end, it
is usually recommendable to evaluate f at m equidistant points xi=a+(b-a)(i-1)/(m-1),
19~m, and estimate 0-2 by the maximum likelihood estimator

Of course, it can be profitable to reestimate 0-2 during the search.


Let now 0-2 be known and k;:::m evaluations Yi=/(xi) of / be performed at the ordered
points a=x 1<... <xk=b. Then the conditional mean Il(x ILk) and variance 0-2(x ILk) are
computed through

(2.4.10)

(2.4.11)

for each XE Lli=[xi,xi+ 1], i=l, ... ,k-l. Moreover, the expectation of the minimum value
<p. = min <p(x)
1 XE~
i
in an interval Lli conditioned on Lk is

(Y j +1 -Y j ) 2 } 0 2
x exp{ 2
20- (x. - x.)
f
-00
p(t,-IY'
1+
1- Y·I,0- (x.
1 1+
1- x .»dt
1
(2.4.12)
1+1 1
where

-1/2
p( t,a,02) = (21t02) exp{ - (t - a)2/202}

is the Gaussian density. By (2.4.12) one can compute the posterior mean of <Pi for each
interval Lli and select the interval
74 Chapter 2

Ll j = arg 1n E(<PiILJ.
i

The next point xk+ 1 can be chosen in the interval Llj in different ways. The simplest one is
xk+ 1=(Xj+Xj+ 1)! 2, i.e. xk+ 1 is the centre of Llj- If we want to confine ourselves to
Bayesian techniques then it is natural to select xk+ 1 as the expected location of the
minimum <Pj but the corresponding formulas are rather complicated. (Note that this
approach was followed by Boender (1984), where favourable numerical test results are
also given.)
The one-step P-optimal algorithms (2.4.9) are determined in an easier way:

j=arg max R(i), (2.4.13)


l=O;i=O;k-l
where

R(.) =
<f *k - £ k)(X. 1 - x .) - y. 1(a. - x. ) - y. (x. 1 - a.)
1+ 1 1+ 1 1 1 1+ 1
1 1/2 '
a«a i -xi)(xi+l-ai)(xi+l-xi))

Xi +xi+l (x i+ 1 -x i )(Yi+l- Yi)


ai = 2 + 2{ Yi+l + Yi - 2(f: - £ k)

The efficiency of algorithm (2.4.13) depends to a considerable degree on the choice of £k.
Zilinskas (1981) proposed to choose

As a stopping rule for the above algorithms, one may choose the following: reject
subintervals Lli=[xi,xi+ 1], if the probability of finding a function value in Lli better than
the current optimum estimate fk *, i.e.

* <f *k - Y i )(f *k - Y i +1) }


pr{ min <p(x) < f klLk} =exp{ - 2 (2.4.14)
XE ~ . a (x. - x.)
1 1+1 1

is sufficiently small (not greater than a given number £0>0), and terminate the algorithm if
all subintervals except the one corresponding fk * are rejected. Note that if this stopping
rule is applied then the algorithm can be regarded also in the class of branch and
probability bound methods described in Section 4.3. Besides, the algorithms of this
subsection can be incorporated by the general scheme of one-dimensional global
optimization algorithms, represented in the form of Algorithm 2.3.1.
Global Optimization Methods 75

Further information about construction and investigation methods based on use of


stochastic models may be found in Kushner (1964), Archetti and Betro (1979,1980),
Mockus (1988), Mockus et al. (1978), Zilinskas (1978, 1981, 1982, 1984, 1986),
Boender (1984).

2.4.4 Axiomatic approach

The stochastic function models described above can be viewed as special cases of a more
general axiomatic approach. According to this, the uncertainty about the values f (x) for
xeX\2k is assumed to be representable by a binary relation ~x where (t,t')2.xCt;t')
symbolizes that the event { f (x)e (t,t')) is at least as likely as the event { f (x)e ('t,'t')).
Under some reasonable assumptions on this binary relation (e.g., transitivity and
comleteness),there exists a unique density function Px that satisfies the following
condition: for every pair (A,A') of countable unions of intervals one has A.2.x A' if and
only if

f P x(t)dt2 fPx(t)dt.
A A'

For the special case when all densities are Gaussian and hence are characterized by their
means f.l(x) and covariances cr 2 (x) one can suppose that the preference relation .2.x is
defined on the set of estimators of f.l(x) and cr2(x).
Subject again to some reasonable assumptions about this preference, the result is that
the unique rational choice for the next point of evaluation of f is the one for which the
probability of finding a function value, smaller than fk *-10k is maximal. (This result
justifies the one-step P-optimal algorithms (2.4.9». In the case of one-dimensional
Wiener process, (2.4.9) together with (2.4.10) and (2.4.11) lead to (2.4.13). In the case
of higher dimension, analogies of (2.4.10) and (2.4.11) are not valid, but some
approximations for f.l(x ILk) and cr 2(x ILk) can be axiomatically justified, e.g.

k
f.lk(XIL k) = i:/ i W i(X,LJ,
2 k
cr lxlLJ = cki~llix - xdIW i( x, L k)

where ck is a normalizing constant and the weights Wi(x,Lk) have some natural
properties, see Zilinskas (1982, 1986).
76 Chapter 2

2.4.5 Information-statistical approach

The information-statistical approach is similar to the above described Bayesian one and
was mainly developed by Strongin (1978). Its essence is the following: the feasible region
'X is discretized, i.e. a finite point collection 3N={Xl,... xN} is substituted for 'X and the
N-vector F=(f(x 1), ... , f (xN» approximates the objective function f. So, RN is
substituted for the functional set f'. Setting prior information about f consists in setting up
a prior probability density cp(F) on RN which must be successively transformed into a
posterior density after evaluating f. The points at which to evaluate f can be determined,
for instance, as the maximal likelihood estimators for x*. This idea leads to extremely
cumbersome algorithms which are practically not manageable in the multidimensional
case. In the one-dimensional case, however, a slight modification of this idea led Strongin
(1978) to the construction of the algorithm (2.3.1) - (2.3.4).
PART 2. GLOBAL RANDOM SEARCH

CHAPTER 3. MAIN CONCEPTS AND APPROACHES OF GLOBAL


RANDOM SEARCH

The present chapter contains three sections. Section 3.1 describes and studies the simplest
global random search algorithms, outlines the ways of constructing more efficient
algorithms, presents a general scheme of global random search algorithms and discusses
the connection between local optimization and global random search. Section 3.2 proves
some general results on convergence. Section 3.3 is devoted to Markovian algorithms,
that are thoroughly theoretically investigated in literature.

3.1 Construction of global random search algorithms


Basic approaches
This section can be referred to as an introduction to the methodology of global random
search. It describes a few simple global random search algorithms and ways of increasing
their efficiency. The simplest algorithm is considered first.

3.1.1 Uniform random sampling


According to the general concept of global optimization, any global optimization algorithm
has to search in all the feasible region X in some way or another. The simplest of these
ways is a uniform sampling in X that can be accomplished in both deterministic
(described in Section 2.2.1) and stochastic fashion. The simplest stochastic (global
random search) method consists in choosing the points at which f is evaluated randomly,
independently and uniformly in X.

Algorithm 3.1.1. (Uniform random search in X).

1. Set k=l, f 0 *=00.


2. Obtain a point xk by sampling from the uniform distribution on X.
3. Evaluate f<Xk) and set

(3.11)

4. If k=N, then terminate the algorithm; choose the point xk* with f ( xk*)= fk * as an
approximation for x*=arg min f. If k<N, then return to Step 2 (substituting k+ 1 for k).
Algorithm 3.1.1 has some different names, viz., crude search, pure random search,
random bombardment, Monte Carlo method, etc. It utilizes the simplest stopping rule

77
78 Chapter 3

which terminates the algorithm after a given number N of evaluations of J. Chapter 4


describes and studies a mathematical statistical apparatus that can be used as a basis of
various stopping rules of Algorithm 3.1.1 and many others according to which an
algorithm terminates by attaining a given accuracy.
While using Algorithm 3.1.1 in practice, it is usually profitable to descend locally
from one or several points with the lowest function values obtained (this is true for almost
all global random search algorithms, as will be discussed later in Section 3.1.5).
The simplicity of Algorithm 3.1.1 makes possible the direct investigation of its
theoretical properties considered below.
Let x*=arg min J be a global minimizer of J. For the set B(e)=B(x* ,e) we have

(3.1.2)

where Iln is the Lebesgue measure, and (3.1.2) becomes an equality if {XE Rn: II x-
x*lI~e}CX, i.e. if the distance from x* to the boundary of X is not less than e. Using
(3.1.2) we obtain for all e>O, k=I,2, ...

(3.13)

pr{ min Ilx. - x*11 :::;; e}


l~i::::;k 1
= 1- (1- Iln{B(eH/lln(X»k:::;;

k
:::;; 1- [1-n n12e n/(lln(X)1(n/2 + 1»J ~ 1, k ~oo.

(3. 14)

The relation obtained shows that the sequence

min Ilx - x*11


l~i~k i

converges in probability to zero for k~oo. Moreover, there is an estimator of the


convergence rate in (3.1.4). The estimator of the expected number of steps before hitting
into the set B(e) is easily obtained from (3.1.3)

(3.1.5)

where 'tA is the moment of first hit of the search sequence xlox2, ... into a set ACX.
These formulas estimate the rate of convergence of Algorithm 3.1.1 with respect to values
Main Concepts and Approaches 79

of the argument. The rate of convergence with respect to function values is estimated
analogously:

(3.1.7)

Note, that to simplify calculations in (3.1.4) and (3.1.6) one can use the approximation
(l-p)k "'e-kp (valid for p",O).
Although Algorithm 3.1.1 converges in various senses, this convergence is slow and
greatly depends on the dimension n of the set X. Let us calculate how many evaluations N
of f one has to perform, in order to reach a probability not less than l-y (where ')'>0 is a
small number) of hitting into B(£). Supposing that equality holds in (3.1.2) (that certainly
holds for x*e int X and sufficiently small £) compare the right-hand side of (3.1.4) with
(1-y) and solve the obtained equation with respect to k=N. We get

Let us take J.!n(X)=I, y=0.1, £=0.1 and consider in Table 2 the dependence N=N(n).

Table 2. The dependence N=N(n)


n 1 2 3 4 5 6 8 10 20 100

N 11 73 549 4666 43744 4.5x 105 5.7x 107 9x 109 9x 1021 10140

Some related recent results of Deheuvels (1983) and Janson (1986, 1987) concerning
multivariate maximal spacings are of interest in the context of global random search
theory. We shall present the principal results below.
Let J.!n(x)=l, A be a cube or a ball in:R.n of volume J.!n(A)=l, 3N={xl> ... ,xN} be an
independent sample from the uniform distribution on X. Set ~N=suP{ll: there exists
xe:R.n such that x+llAC X\3N} and define the maximal spacing as VN=(/1N)n, i.e. as
the volume of the largest ball (or cube of a fixed orientation) that is contained in X and
avoids all N points of 3N. The result of Janson (1987) states that
80 Chapter 3

NVN-logN
lim =n-l (3.1.9)
N~oo log logN

almost surely, and for N~oo the sequence

NVN - log N - (n - 1) log log N+~

converges in distribution to a random variable with c.d.f. exp{ -e-q, where ~=o if A is a
cube and

~ = log r(n + 1) - (n - 1)10g(v'"1t nn/2 + 1)/r( n ;1 )]


for the case, when A is a ball. As the latter value ~?o, spherical spacings are somewhat
smaller (for n?3) than the cubical ones.
A related result on the multivariate maximal spacing in the particular case
X=A=[O,I]x[O,I] was presented by Isaac (1988). With respect to the asymptotic study of
the c.d.f. FN,n(t)=Pr{VN<t} of maximal cubic spacings, it states that

I-F N (t)
,n
lim n N =cn(t) (3.1.10)
N~oo N (1 - t)

holds for n=2 and each tE (0,1), where cn(t) is a constant depending on nand t. It is
widely known that for the univariate case (i.e. for X=A=[O, 1]) the relation (3.1.10) holds
with c n (t)=1.
Some further investigation of the properties of the uniform random search algorithm
can be found in Anderson and Bloomfield (1975), Yakowitz and Fisher (1975).
The uniform random search algorithm finds major applications in global random
search theory as a pattern in theoretical and numerical comparison of algorithms and also
as a component of many global random search algorithms. It is used also for investigating
diverse procedures of mathematical statistics.
The slow convergence rate of Algorithm 3.1.1 has served as a reason for creating a
great number of generalizations and modifications discussed below.

3.1.2 General (nonuniform) random sampling

A random independent sampling of points in X with some given nonuniform distribution


is the simplest generalization of Algorithm 3.1.1. It is used in cases, when (i) information
concerning the objective function is available (perhaps obtained through some previous
sampling) permitting to prefer some subsets of X more than others, or when (ii) the
sampling problem for the uniform distribution on X is hard or practically unsolvable. The
algorithm is as follows.
Main Concepts and Approaches 81

Algorithm 3.1.2. (General (nonuniform) random sampling of points in 'X).

1. Choose a probability distribution P on 'X, set k=l, fo *::00.


2. Obtain a point xk by sampling from P.
3. Evaluate f(xk) and determine tk* (see (3.1.1».
4. If k=N, then terminate the algorithm: choosing the point xk* with t (xk *)= fk * as an
approximation for x*. If k<N, then return to Step 2 (substituting k+ 1 for k).
In the case of uniform distribution P(dx)=dx/lln('X): hence Algorithm 3.1.1 is a
special case of Algorithm 3.1.2.
t
For the convergence of Algorithm 3.1.2 the continuity of in a vicinity of a global
minimizer x* and the validity of the inequality P(B(E»>O for each 10>0 are sufficient. In
this case, the convergence statements concerning Algorithm 3.1.1 are valid, with
corresponding modifications for Algorithm 3.1.2. The analogy with (3.1.4) - (3.1.8) are

Pr{ min Ilx. - x*11 s; E} = 1- (1- P(B(E»)k -41, k -400.


l:5:i:5:k 1

E'tB(£) = IIp(B(e)),

Pr{ x~ E A(8)} = 1- (1- P(A(8») k -41, k -400.

+
N = [logy/ 10g(1- P(B(E»)] '" - (logy)IP(B(E».

The practical significance of Algorithm 3.1.2 is connected mainly with the fact that it may
be used as a component in more complicated algorithms, see later Section 3.1.5. Besides,
if the independence condition for {xk} and the identity of their distributions are weakened,
then more practical algorithms can be constructed: such methods will be described in
Sections 3.3.2, 3.3.5.

3.1.3 Ways o/improving the efficiency o/random sampling algorithms


The low efficiency of Algorithms 3.1.1 and 3.1.2 is caused largely by their passive
character: viz., they do not use the previously obtained information when selecting new
points. The ways of improving their efficiency are connected with various modes of
evaluation and use of the information obtained during the search and/or given a priori..
The corresponding global random search algorithms are sometimes called sequential or
adaptive, in order to contrast them with the passive ones.
82 Chapter 3

A simple idea of covering of 'X by balls with centres at points generated at random is
close to that discussed in Section 2.2. It can be properly realized if e.g. Lipschitzian
information about the objective function is available. Section 3.1.4 describes the
corresponding algorithms.
A simple manner of including adaptive elements into the global random search
technique consists of determining a distribution for xk+ 1 as depending on the previous
point xk and objective function value f(Xk). The corresponding algorithms are called
Markovian (since the points x1,x2, ... generated by them form a Markov chain) and will
be studied in Section 3.3. Their theoretical properties are intensively studied nowadays,
but prospects of their practical usefulness are still not quite clear.
An important way of improving efficiency in global random search algorithms is
connected with the inclusion of local descent techniques, see later Section 3.1.6. A simple
algorithm of such a kind is the well-known random multi start consisting of multiple local
descents from uniformly chosen random points. It will be theoretically studied in Section
4.5. Its theoretical efficiency is rather low, but some of its heuristic modifications
described in Sections 2.1.3 and 3.1.6 can be regarded as fairly efficient for complicated
optimization problems.
Another important means of constructing efficient global random search algorithms
consists of using mathematical statistics procedures for deriving information about the
objective function. Such information can serve, in particular, to check the obtained
accuracy for many algorithms and to determine corresponding stopping rules. Chapters 4
and 7 are devoted to various problems connected with the construction of statistical
inference procedures and their application in global random search.
A further direction of improving global random search efficiency is to reduce the share
of randomness. For instance, Section 2.2.1 shows that the method consisting of
evaluation of f at quasirandom points is more efficient than Algorithm 3.1.1. However,
nonrandom points are in some sense worse, than random ones, due to the following
reasons: (i) generally, statistical inferences for deriving information about f can not be
drawn if the points are not random, (ii) if the structure of 'X is not simple enough, then
the problem of constructing nonrandom grids is usually much more complicated, than that
for random grids.
A stratified sample may be regarded "as an intermediate between random and
quasirandom ones. To construct such a sample of size N=m ~, the set 'X is divided into m
subsets of equal volume and the uniform distribution is sampled ~ times in each subset.
Already Brooks (1958) pointed out some advantages of stratified sampling as a substitute
for pure random search. But only recently the gains caused by this substitution were
correctly investigated. (Section 4.4. contains these results as well as suitable statistical
inferences.)
Many global random search algorithms are based on the idea of more frequent
selection of new points in the vicinity of good points, i.e. those earlier obtained ones in
which the values of f are relatively small. Corresponding methods (which go under the
name of methods of generations) will be considered in Chapter 5. Note that many methods
of generations can be used also for the case when a random noise is present in the
evaluations of f .
The approaches mentioned do not cover completely the variety of global random
search methods. Some of them have been described above (in particular, Algorithm 2.3.1
and the method of Section 2.1.6 based on smoothing the objective function). Many others
Main Concepts and Approaches 83

can be easily constructed by the reader: to do this, it suffices to enter a random element
into any deterministic algorithm. (This method was used e.g. by Lbov (1972) who
proposed to seek the minimum of the minorant (2.2.30) of the objective function by a
random sampling algorithm, and thus transformed the polygonal line method of Section
2.2.2 into a global random search algorithm).

3.1.4 Random coverings


The idea of covering studied in Section 2.2 is used here to improve the efficiency of the
simplest random search algorithms.
First, let us introduce the notation Pz for the uniform distribution on a set Z and
consider the algorithm of Brooks (1958, 1959) in which X is covered by balls of equal
radius e with centres at uniformly chosen random points.

Algorithm 3.1.3. (uniform random covering).

1. Set k=l, f 0*=00, ZI =X.


2. Obtain a point xk by sampling from PZk .
3. Evaluate f( xk) and determine fk *.
4. If k=N or the volume of Zk is sufficiently small, then terminate the algorithm: choose
the point Xk * with f< xk *)= fk * as an approximation for x*. If k<N, then set
Zk+ 1=Zk \ Be(xk) and return to Step 2 (substituting k+ 1 for k).
Provided that the distributions Pz are sampled in Algorithm 3.1.3 using the rejection
technique (i.e. the distribution Px is sampled until a realization occurs in Z), it differs
from Algorithm 3.1.1 in the following detail only: if at the k-th iteration of Algorithm
3.1.3 a random point, uniformly distributed in X, falls into a ball B(xi,e) whose centre is
a previously accepted point Xi O.si.sk), then this point is rejected, f is not evaluated at it,
and a new random point is generated.
Radius e of the balls in Algorithm 3.1.3 determines the accuracy of the approximation.
Thus, if f E .Lip(X,.L) and e =o/.L then the objective function values at the rejected points
(each of them belonging to a ball B(xi,e), l.si.sN) can not exceed fN* + O.
Certainly, instead of the balls B(xi,e) one can use other sets S(xi) containing Xi as the
forbidden sets for new points in Algorithm 3.1.3. For X=[O,I]n it is easy to investigate
the algorithm in which S(xi) are the 11-adic cubes (these cubes are obtained by dividing
each side of the cube X into 11 equal parts, see Section 2.2.1). Indeed, for this algorithm
we have
84 Chapter 3

Pr{xi E S(x*) for some i S;k} =


k
=1-~1(1-1_(t_1)C)=Ck

for each k:91 n instead of (3.1.4) where c= T\-n is the volume of each T\-adic cube S(xi)'
Algorithm 3.1.3 does not use the information which is contained in the values of I and
so its efficiency can not be high. The following method of Devroye (1978) constructed
under the supposition IE Lip(X,L,p) is of higher efficiency.

Algorithm 3.1.4. (nonuniform random covering).

1. Set k=l, 10*=00, Zl=X,


2. Obtain a point xk by sampling PZk.
3. Evaluate f( xk) and determine Ik*.
4. Compute

(3.1.11)

for each i=l, ... ,k where 0 is a given positive number.


5. Set

k
Zk 1 =X\ UB(x.,T\ .,p)
+ i=l 1 1

6. Terminate the algorithm, if Zk+ 1 is empty or k=N, otherwise return to Step 2


(substituting k+ 1 for k).
Algorithm 3.1.4 is a typical covering method: it differs from the methods of Section
2.2.2 in the way of obtaining points xk. According to the results of Section 2.2.2,
Algorithm 3.1.4 finds a global minimizer with the accuracy 0 with respect to values of I,
i.e. I k *- 1,$0 where k is the last iteration index for which the set Zk+ 1 becomes empty.
It is not necessary to use just PZk, the uniform distribution on Zk: at each k-th iteration
of Algorithm 3.1.4, any distribution Pk on the set Zk can be used instead of PZk ' In order
to ensure the convergence of the algorithm Devroye (1978) supposed
00

Pk=ukP z + (1- uk)G k where uk~O, L. uk=oo


k ~l
and Gk is an arbitrary distribution on Zk (for instance, corresponding to performing some
local descent iterations). Algorithm 3.1.4 can be modified for the case when f"=C(X), i.e.
I is a continuous function. Devroye (1978) proved for this case that if the sequence {Uk}
is defined as above and
Main Concepts and Approaches 85

instead of (3.1.11), where ~k>O, ~k~O for k~oo, then the algorithm converges almost
surely with respect to the values of f. Closely related general convergence investigations
can be found e.g. in Solis and Wets (1981) or Pinter (1984).

3.1.5 Formal scheme of global random search

We shall describe here a formal scheme of global random search algorithms that may be
useful investigating some of their general properties.

Algorithm 3.1.5.

1. Choose a probability distribution PIon 'X, set k=1.


2. Obtain points
(k) (k)
Xl , •.• ,x N
k
by sampling Nk times from the distribution Pk. Evaluate f (perhaps, with a random noise)
at these points.
3. According to a fixed (algorithm-dependent) rule construct a probability distribution
Pk+1 on 'X.
4. Check some appropriate stopping condition; if the algorithm is not terminated, then
return to Step 2 (substituting k+ 1 for k).
In other words, any global random search algorithm involves some iterations; at each
iteration, a suitably constructed distribution is sampled. For several classes of algorithms
(e.g., described in Sections 3.3. and 5.4) Nk=l for each k=1,2, ... and their
representation in the form of Algorithm 3.1.5 is completely formal and gives nothing. But
for some others (e.g., described later in Sections 4.3, 5.1 and 5.3) this representation is
essential and helps to understand the algorithm and the possibilities of its theoretical study.
The construction of the distributions (Pk+ I} in Algorithm 3.1.5 determines the way
of deriving and using the information given a priori or received during the search
concerning the objective function. As a rule, they can be written like

(3.1.12)

where Rk is a probability distribution on 'X and Qk(z,.) is a (Markovian) transition


probability, i.e. a measurable nonnegative function with respect to the first argument and a
probability measure with respect to the second. The distribution (3.1.1) is sampled
applying the superposition technique, viz., first Rk and then Qk(z,.) is sampled, where z
is the realization obtained via sampling Rk.
The sense of Rk and Qk(z,.) in the representation (3.1.12) is different. Namely, the
construction of Rk takes into account the information of a global feature of f, but that of
Qk(z,.) considers a local feature of f. In sampling Rk, a point z from all 'X is chosen, but
86 Chapter 3

in sampling Qk(z,.) a point from the neighbourhood of z is selected (the word


neighbourhood is meant here in the probabilistic sense: with large probability near
enough).
The way of constructing distributions Rk establishes a general structure and originality
of the algorithm to a great degree. The description of such ways and the study of
corresponding algorithms is the main object of Chapters 3, 4, 5. (Let us remark in
advance that in the Markovian algorithms of Section 3.3, and Algorithm 5.1.1 the
distributions Rk are concentrated at one of the earlier obtained points; in most methods of
generations of Chapter 5 - at the points of a preceding iteration; and in branch and
probability bound methods of Section 4.3 the distributions are uniform on subsets of 'X.
which have been recognized as promising for further search.
Alternatively, a choice of transition probabilities Qk(z,.) determines a local behaviour
of an algorithm. Let us study now some ways of this choice.

3.1.6 Local behaviour of global random search algorithm


Consider first the interdependence of local and global behaviour of global random search
algorithms as presented in Algorithm 3.1.5 with the distribution Pk+l in (3.1.12). The
algorithm design should take into account the desired solution accuracy: if this accuracy is
high, then the algorithm must have good local performance; but if the aim is to hit the set
A(8) for some not too small 8>0, then the local properties can result in an inefficiency -
the better is the local performance, the higher is the amount of additional evaluations of f.
Anyway, at the first iterations ( for small k) where the main search task is to explore the
global properties of f, a simple choice of the transition probabilities is recommandable
requiring no additional evaluations of f and being convenient for theoretical study. For
instance, such a choice is

(3.1.13)

where <Pk is a distribution density in R.n. In order to obtain a random realization xk in 'X.
from the distribution (3.1.13), one needs to obtain a realization ~k in R.n from the density
<Pk, to check the relation z+ ~kE 'X. (if it does not hold, then a new realization ~k is
needed), and take xk= z+ ~k. The choice (3.1.13) is natural only in the case, when a
random noise is present in the evaluations of f. Similarly, the transition probabilities are
often selected in the form

(3.1.14)
Main Concepts and Approaches 87

where Tk(z,dx) is a Markovian transition probability having the fonn of (3.1.13). Given a
fixed z, in order to obtain a realization Xk from the distribution (3.1.14) one needs first to
get a realization ~k from the distribution Tk(z,.) and put

if f (~k) ~ f(z)
x
k
={~k
z otherwise .

Attempts to determine the details of objective function behaviour allover the set 'X, by
means of a small number of evaluations, cannot be successful; rough approximations of f
can be used for detennining the points where many evaluations of f should be carried out
for an expected deep local descent only in the simplest cases.
On the other hand, if there are good reasons to believe that some points x/k )
G=l, ... ,Nk) of the k-th iteration of Algorithm 3.1.5 are not very far from a global
optimizer, then local descent may be profitable. The depth of the descent is defined by the
transition probabilities Qk(z,.): for a fixed realization z of Rk, the algorithm of choosing
the point Xj(k+l) consists in carrying out several (possibly, one) local descent iterations
defined by the method of Qk(z,.) sampling. If Qk has the fonn of (3.1.13), then the local
descent is not perfonned and x/k +1) is chosen in a vicinity of z; if Qk has the fonn of
(3.1.14), which is possible only If there is no evaluation noise, then a single local random
search iteration is done (with return for unprofitable steps). One can directly see how the
method of sampling Qk should be defined in order to correspond to one or several
iterations of either other local random search algorithms (they are thoroughly described
e.g. in Rastrigin (1968», or of any detenninistic local descent (in the latter case the
distributions Qk(z,.) are degenerated for each z), or even of stochastic approximation type
algorithms (in the case of evaluation noise).
It is also evident how to make the number of local iterations from z to be proportional
to the evaluated value of f(z). The extremal case where sampling Qk(z,.) corresponds to
the transition from the point z to a local minimizer, (whose domains of attraction z belongs
to) is unlikely to be acceptable, except very simple situation. Optimization problems, of
course, occur where evaluation of the derivatives of f is rather simple; in this case, local
descent iterations may prove useful already at the first iterations. Corresponding
algorithms can be regarded as modifications of the random multi start method, see Sections
2.1.3 and 4.5.
It is worth to note that there is no need to know the analytical fonn of the transition
probabilities Qk(z,.), as well as the distribution Rk: one needs only an algorithm for their
sampling, i.e. for passing from z to the point of the (k+ 1)-st iteration The efficiency of a
global search algorithm, hence, may be improved if the complexity of an algorithm for
sampling Qk(z,.) is increased with k (including also additional evaluations of f). In doing
so, it seems natural to take smaller Nk for greater values k than for small ones. The
quantities Nk and the indices of iteration, where sampling algorithms for Qk(z,.) become
more complicated can be defined in advance (using the prior knowledge about the
objective function bahaviour and the accuracy of resulting approximations) as well as in
the course of the search (using the obtained information concerning f).
88 Chapter 3

3.2 General results on the convergencf! of global random search


algorithms
A number of works are devoted to the derivation of sufficient conditions for convergence
of general global random search algorithms: notable examples include Devroye (1976,
1978), Marti (1980), Solis and Wets (1981), Pinter (1984). These works contain results
similar to Theorem 3.2.1 prsented below as well as consequences for some particular
methods (including stochastic programming methods when the objective function is
subjected to a random noise and the incidence xeX is valid with a probability depending
on x). Instead of using the results of the above mentioned works, we shall derive the
convergence results relying upon the classical zero-one law.
Consider a general global random search method represented in the form of Algorithm
3.1.5; without loss of generality assume that Nk=1 for all k=I,2, .. , that is a separate
distribution Pk is constructed for each point xk=xl (k).

Theorem 3.2.1. Let f be continuous in the vicinity of a global minimizer x* of f, and


assume that
00

Lq = 00 (3.2.1)
k=l k

for any x e X and £ > 0 where

qk=qk(X*,£)= ~ai infPk(B(£»,


'::'k-l

Then for any 0>0 the sequence ofrandom vectors xl>x2, ... generated by Algorithm 3.1.5
with Nk=1 (k=I,2, ... ) falls infinitely often into the set A(I5) with probability one.

Proof. Fix 15>0 and then find £>0 such that B( £ )CA(I5). Determine the sequence of
independent random variables {11k) on the two point set {O,l} so as to obtain for fixed
£>0: Pr{11k=1 )=I-Pr{11k=O)~(x*,£). Obviously, the probability ofxk falling into B(£)
is, for all k=I,2, ... , not less than the probability ~ of 11k getting into the state 1, and,
therefore, the theorem's assertion will be proved if one demonstrates that the sequence 11k
infinitely often takes the value 1. Since the latter follows by (3.2.1) and the first part of
Borel's zero-one law, Theorem is proved.
Theorem 3.2.1 is valid also for the general case when function f is subject to random
noise. If the noise is absent, then the conditions of the theorem ensure that the point
sequence {xk) converges to the set X*={arg min f) of global minimizers with
probability 1.
In virtue of the Borel-Cantelli lemma, one can see that if
00

L P k(B(£» < 00,

k=l
Main Concepts and Approaches 89

then the points xl,x2, ... fall into B(E) at most a finite number of times. Moreover, as is
illustrated by the function

XE 'X=[0,l],

where 10 is a sufficiently small positive number, (3.2.1) cannot be improved for a rather
wide class of functions:f (even for the class of analytical functions with two minima), i.e.
if (3.2.1) is not satisfied, there exist f E l' and 10>0 such that none of the points
x lox2, ... ,xN falls into B(E) with a positive probability where N is an arbitrarily great
fixed number.
Since the location of x* is not known a priori, instead of (3.2.1) some more strict, but
simpler requirement
00

I vrai )nf P k(B(x,e» = 00 (3.22)


k=l ':'k-l

can be prescribed for all 10>0, XE 'X.


Let us consider some known ways of selecting probability measures Pk for which the
fulfilment of (3.2.2) may be easily guaranteed.
In practice, one of the most popular ways of selection of distributions Pk is

(3.23)

where O~a.k~l, P'X is the uniform distribution on 'X, Ok is an arbitrary probability


measure on 'X (for instance, the sampling of Ok may correspond to performing some
iterations of a local descent from the point xk-l *). The sampling from the distribution
(3.2.3) means that of P'X with probability a.k and that of Ok with the complement
probability 1- a.k. In spite of the quite common belief that

liminf a. k > 0
k~oo

should be satisfied for the corresponding algorithm to converge, the weaker requirement

Ia. =00
k=l k

is sufficient for (3.2.2) for ensuring convergence.


In practice, one often chooses Pk also in the form of (3.1.12) and the Markov
transition probabilities Qk(z,.) in the form of (3.1.13) (in the case of evaluations subjected
to a noise) or (3.1.14) where Tk(z,.) are chosen following (3.1.13). To satisfy (3.2.2) in
90 Chapter 3

the above cases, it is obviously sufficient that for each k=I,2, ... the density CPk of
(3.1.13) be represented as CPk(x)=13k- n cp(x/13k), where (13k) is a nonincreasing sequence
of positive numbers and the density cp(x) is symmetrical, continuous, decreasing for x>O,
decomposes into a product of one-dimensional densities, and that the condition

dn+E
f 13~n cp(x/13k) dx = 00 (3.24)
dn

be satisfied where di is the diameter of 'X. with respect to the i-th coordinate, £>0 is
arbitrary. It is, of course, difficult to check (3.2.4) in the general case, but it becomes
rather simple for some particular choices of cpo Consider e.g. the following methods of
choosing cp:

(3.25)

(3.2.6)

where x=(x(l), ... ,x(n», and Ai>O (i=l, ... ,n) are arbitrary constants. The coordinates of
random vectors distributed with densities (3.2.5) and (3.2.6) are independent and
distributed according to the Laplace and Gaussian distributions, respectively.

Lemma 3.2.1. Let cp be defined according to (3.2.5). Then (3.2.4) is satisfied if for any
b>O the following relation holds:

(3.27)

In partiCUlar, 13k (k~3) can be chosen as 13k =q/log(C2 log k) where q>O, cpl are
arbitrary.

Proof. Set A=max ( AI, ... , An). The condition (3.2.4) will be satisfied for the density
(3.2.5), if it is satisfied for the density

cP * (x) = (A/2) n ex~ - Ai~ll x(i) I} .


(3.2.4) is satisfied if for any £>0
Main Concepts and Approaches 91

co d+e
L f ... f
d+e -n
~k <p*(x/~Jdx=
co
LL l-e
fl( -Ae//3k)
e
-Ad!l3 k J =00
n

k=1 d d k=1

is satisfied, where d=max {d b ... ,dn }. Since

for all k=1,2, ... , the latter condition does hold if (3.2.7) is met with b=exp(-nAd).
The lemma is proved.
Similar reasoning takes place for (3.2.6), where instead of (3.2.7) we have
2
00 l//3 k
L b = 00,
k=l

and a possible choice of parameters ~k is

for c 1 > 0, c 2 ~ 1, k ~ 3.

To obtain sufficient conditions for convergence of global random search algorithms,


Moiseev and Nekrutkin (1987) generalized some results of stochastic approximation
theory and proved the next statement (an analogous proof can be found also in
Zhigljavsky (1985)).

Theorem 3.2.2. Let f be a bounded function on 'X. attaining its global minimum at the
unique point x*. Assume, that in the vicinity of x*, f is continuous, and for any £>0
there exists 0=0(£) such that

to _ vrai inf
'::'kE(A(e))
k P k+l ( { x E 'X.: f (x) < f~ - o} ) = 00. (3.28)

Then Xk*~x* for k~oo with probability 1.


It seems that the condition (3.2.8) is a bit less restrictive, then (3.2.1). But it is
unknown yet whether the consequences of Theorem 3.2.2 are more interesting than those
of Theorem 3.2.1.
One may qualitatively interpret the relationship (3.2.8) as follows. If xk*=x and in the
vicinity of x the function f varies intensively (it implies that x is far from the local
minimizers), then the search has to be local (i.e. xk+ 1 must be located close to x) with a
sufficiently high probability. If in the vicinity of x the function f is slowly varying (and
thus it is likely that x is close to a local minimizer of f) but it is hardly possible that x is
close to x* then the search must be global, i.e. xk+ 1 must be far from x with a high
92 Chapter 3

probability. At last, if in the vicinity of x function f is slowly varying and it is possible


that x is in the region of attraction of x*, then it is worthwhile to alternate local and global
steps: the first ones ensure convergence to x* (if the region of attraction was really
attained), while the second ensure divergence of the series (3.2.8) (if the region was not
attained).
Finally, let us indicate a general result on the speed of convergence of global random
search algorithms. Set

a(k,j3,e) = vrai inf k Pr{f~+1 < j3f(x 1) + (1- j3)f~},


:=: kE(A(e)

'Yr(k,e)= sup (1-(I-j3/)a(k,j3,e), Cr =E(f(x 1)-f(x 2 »)r.


O<lkl

Then, according to Moiseev and Nekrutkin (1987), the inequality

holds for all r>O, N~l.


It should be noted that the above stated general results on convergence and on
convergence speed are mainly of theoretical importance.
Main Concepts and Approaches 93

3.3 Markovian algorithms


Markovian algorithms of global optimization represent an advanced field of random search
theory. These are algorithms based on sampling Markov chains, giving a sequence of
random points xl ,x2, ... such that xk+2 is independent of the collection of points and
objective function values

(3.3. 1)

for each k=I,2, ...


The great theoretical importance of Markovian algorithms is not equivalent to their
practical significance. They are inefficient for complicated problems owing to the
neglection of the infonnation contained by (3.3.1): this finding is confinned by various
numerical results.
This section summarizes the recent results on Markovian algorithms: its subsections
present a general algorithm scheme, describe the well-known simulated annealing metr,,,-l
as well as some particular algorithms, and analyse methods connected with the solution or
stochastic differential equations.

3.3.1 General scheme ofMarlwvian algorithms

Let Yk (k=I,2, ... ) be a result of evaluating the objective function f at a point Xk (may be,
this evaluation is subject to noise); further let Pk+ 1 (.)=Pk+ 1(.1 xl,Yl , ... ,xk,yk) be a
probability distribution of a point xk+ 1 generated by a global random search algorithm.
The Markovian property of the algorithm means that for all k=I,2, ... the distributions
Pk+ 1 depend only on xk,yk, that is

(3.3.2)

If the evaluations of f are not subject to noise, then (3.3.2) takes the fonn Pk+l (.1 Uk)=
Pk+ 1(.I xk,Yk)·
In essence, Markovian global random search algorithms are modifications of local
ones, alternating local steps with global ones. If the evaluations of f are subject to noise,
these algorithms are sometimes named multi extremal stochastic approximation algorithms
due to the analogy with the local case (for instance, see Vaysbord and Yudin (1968».
Different variants of Markovian global random search algorithms were proposed and
studied by many authors, starting from the late 1960's. In spite of abundance of works,
many algorithms differ only in secondary details. Below we shall describe the principal
algorithms, starting with a general scheme for the case 'Xc:Rn.

Algorithm 3.3.1. (General scheme of Markovian algorithms).

1. Sampling a given distribution PI, obtain a point Xl. Evaluate Yl, the value (may
be, subject to noise) of the objective function f at the point x 1. Set k= 1.
94 Chapter 3

2. Obtain a point zk from R,n by sampling a distribution Qk(xk,.) with the density
(}k(xk,x) depending on k and Xk.
3. If zke 'X, return to Step 2. Otherwise evaluate 11k, the value of f in zk (11k may be
subject to noise) and set

with probability p k
(3.3.3)
with probability 1- p k

where Pk=Pk(xk, zk, Yk, 11k) is the acceptance probability and may depend on k, xk, zk,
Yk,11k·
4. Set

if Xk+l =zk'
if xk+l = x k·
5. Check some given stopping criterion. If the algorithm does not stop, then return to
Step 2 (substituting k+ 1 for k).
An ordinary way of realizing (3.3.3) consists of calculating Pk, obtaining a random
number (lk, checking the inequality (lk.$Pk, and setting

if
if

Particular choices of initial probability PI, transition probabilities Qk(x,.), and acceptance
probabilities Pk(x,z,y,11) lead to a concrete Markovian global random search algorithms.
To obtain convergence conditions the results of Section 3.2 can be used: however, the
simplicity of Markovian algorithms allows to get more specific results on their
convergence and convergence speed. These are interesting theoretical results, together
with simplicity and plain interpretation which explain the popUlarity of such algorithms.
One of the most well-known is simulated annealing, considered first.

3.3 .2 Simulated annealing


The simulated annealing method was recently proposed by Kirkpatrick et al. (1983) and in
a short time became popular.
It was called simulated annealing because its similarity to the physical procedure called
annealing used to remove defects from metals and crystals by heating them locally near the
defect to dissolve the impurity and then slowly recooling them so that they could find a
basic state with a lower energy configuration. The simulated annealing method can be
referred to as a variation of the classical Metropolis method (see Metropolis et al. (1953),
or Bhanot (1988» which simulates the behaviour of an ensemble of atoms in equilibrium
at a given temperature: it is a powerful technique in numerical studies in several branches
Main Concepts and Approaches 95

of science, being useful in situations when it is necessary to generate random variables


distributed according to a given multivariate probability distribution.
Let the feasible region 'X be a discrete set or a compact subset ofRP and J: 'X-7(O,oo)
be a positive function (not subject to noise). If 'X is discrete, then the unifoTIn measure on
'X replaces the Lebesgue measure and for any function CJ defined on 'X the symbol
JCJ(x)dx stands for

L CJ (xi)'
x.eX
1

According to the physical interpretation, a point XE'X corresponds to a configuration of


atoms of the substance and f (x) detennines the energy of the configuration. Because of a
very large number of atoms and possible arrangements there are a great number of local
energy minimum configurations (that is local minimizers of f).
The standard variant of simulated annealing is as follows.
An initial point xl E 'X is chosen arbitrarily. Let xk be the point of a k-th step (k.2:1),
f (xk) be the corresponding objective function value, Pk be a positive parameter, and ~k be
a realization of a random vector having some probability distribution cI>k (if 'Xc:RP then it
is natural to choose cI>k as an isotropic distribution on Rn). Then one should check the
inclusion zk=Xk+~kE 'X (otherwise return to obtaining a new realization ~k) evaluate
f (zk) and set ~k= f(zk)-f (xk),

r1 if~k::;;;O
Pk=min{ l,exp(- Pk~J} =i (3.3.4)
lexp(-Pk~J if ~k>O,

with probability
with probability

This means that the promising new point Zk (for which f (zk}sf (xk)) is accepted
unconditionally, but the non-promising one (for which f (zk»f (xk)) may also be accepted
with probability Pk=exp{ -Pk~k}. As the probability of accepting a point which is worse
than the preceding one is always greater than zero, the search trajectory may leave a local
and even global minimizer. (Note that the probability of acceptance decreases if the
difference ~k=f(Zk)-f(Xk) increases.)
The expression (3.3.4) for the acceptance probability Pk is motivated by the annealing
process modelled in the simulated annealing method. In statistical mechanics, the
96 Chapter 3

probability that the system will transit from a state with energy Eo to a state with energy
Et. where Llli=EI-Eo >0, is exp(- Llli/KT) where K=1.38 10- 16 erg/T is the Boltzmann
constant and T is the absolute temperature. Thus f3=1/KT and the lower the temperature
is, the smaller is the probability of transition to a higher energy state.
In a more general case, when Zk is distributed according to a transition probability
Qk(xk,.), the simulated annealing method is a typical Markovian method (see Algorithm
3.3.1): the only particularity being in the form (3.3.4) of acceptance probabilities.
If f3k=f3=1/KT, i.e.

p k(X, z,y, 11) = PC x, z) = min{l ,exp(f3(f(x) - fC z»)} (3. 3.5)

and Qk(x,.)=Q(x,.) are not dependent on the count k, then the point sequence {xk}
constitutes a homogeneous Markov chain, converging (under rather general conditions on
Q and f) in distribution on the stationary Gibbs distribution having the density

1t T (x) = exp{ - f(x)/KT}/Jexp{ - fCz)/KT}dz (3.3.6)

(Density is meant with respect to the Lebesgue measure in the continuous case, and with
respect to the uniform measure in the discrete case.) Consequently, as T-+O (or /3-+00),
the Gibbs density 1tT defined by (3.3.6) tends to concentrate on the set of global
minimizers of f (subject to some mild conditions, see Geman and Hwang (1986), Aluffi-
Pentini et al. (1985». In particular, if the global minimizer x* of f is unique, then the
Gibbs distribution converges to the ~-measure concentrated at x* for T -+0.
Numerically, if T is small (i.e. /3 is great), then the points Xk obtained by a
homogeneous simulated annealing method have the tendency to concentrate near to the
global minimizer(s) of f. Unfortunately, the time required to approach the stationary
Gibbs-distribution increases exponentially with Iff and may reach astronomical values for
small T (as confirmed also by numerical results). This can be explained by the fact that for
small T a homogeneous simulated annealing method tends to be like the local random
search algorithm that rejects unprofitable steps, and so its global search features are poor.
The homogeneous simulated annealing method is the particular case of the above
mentioned Metropolis algorithm that uses the acceptance probabilities

if w (z) ~ w(x)
p(x, z) = min {I, w(z)/w (x)} = { 1 (3.3.7)
w(z)/w(x) if w (z) < W (x)

instead of (3.3.5), where w is an arbitrary summable positive function on 'X. The


expression (3.3.5) comes from (3.3.7) after setting w(x)=exp(-f3f(x». The points Xk
generated by the Metropolis algorithm converge in distribution to a stationary distribution
with the density
Main Concepts and Approaches 97

<Pw(x) = w(x)1 I w(z)dz (3.3.8)

which is proportional to the function wand generalizes (3.3.6). The conditions


guaranteeing the convergence and the convergence speed are thoroughly investigated; for
references see Bhanot (1988). The transition probabilities Q(x,.) are usually chosen in the
Metropolis algorithm in such a way that the transition density from x to z is equal to the
same of z to x for any x,ze'X (the symmetricity property).
Formulae (3.3.7) are not the unique way of choosing the acceptance probabilities of
the above described Markov chain that allows (3.3.8) to be the stationary density of the
chain. According to Bhanot (1988), one can use the more general functional form

g( w (z)/w(x» if w(z) ~w(x)


{
p(x,z)= w(z) (W(Z») if w (z) < w (x)
w(x) g w(x)

for this purpose, where g is an arbitrary function, O~~1, go' O. For instance, for
g(t)=l!(1 +t) we have

w(x)

1
if w (z) ~ w(x)
w(x) + w(z)
p(x,z) =
w(z)
if w (z) < w (x).
w(x) + w(z)
Another way of constructing stationary Markov chains having the stationary density
(3.3.8) is due to Turchin (1971): it consists in setting p(x,z)=1,

q(x, z) = Ip(x, y) cr(y, z) dy (3.3.9)

where p(x,y) is an arbitrary transition density and

a(x,z) = p(z,x)w (z)/Ip(z,x)w(z)dz. (3.3.10)

Let us tum now to the case of unhomogeneous simulated annealing methods that use
~k~oo (or Tk~O) for k~ and, probably, transition probabilities Qk(x,.) depending on
k. It can be seen that the convergence of the distribution of points xk, generated by the
simulated annealing method to a distribution concentrated on the set of global minimizers
of /' can be guaranteed if Tk tends to zero slowly enough. For standard variant of the
method, the choice Tk=c/log(2+k) is suitable where c is a sufficiently large number
depending on / and 'X, see Mitra et al. (1986). Anily and Federgruen (1987) proved
general results on the above convergence and the rate of convergence of the generalized
discrete simulated annealing method described by general transition probabilities Qk(x,.)
and general acceptance probabilities Pk(x,z) tending to either zero, one or a constant for
the cases f(x)</(z), /(x»/(z), or f(x)=/(z), correspondingly.
98 Chapter 3

A particular case of the above generalized simulated annealing method was proposed
and numerically investigated by Bohachevsky et al. (1986): the acceptance probabilities
have the form

r1 if f(z) ~ f (x)
(3.3.11)
p(x.z)=i
l exp{ - ~[f(x) - f minJ~ (f(z) - f(x»)} if f(z) > f(x)

where ~ is an arbitrary negative number (for ~=O (3.3.11) coincides with (3.3.3)) and
f min is an estimate of f* =min f. If, for some x, the value f (x)-f min becomes negative,
then it is proposed to decrease f min and continue the search.
The Reader interested in numerical realization of the simulated annealing concept is
referred also to Corana et al. (1987), Haines (1987), van Laarhoven and Aarts (1987).

3.3.3 Methods based on solving stochastic differential equations

The simulated annealing method can be interpreted as a discrete approximation to a


continuous (vector) process Xt which solves the simultaneous stochastic differential
equations

(3.3.12)

where W t is the standard multivariate Brownian motion. A detailed study on the solutions
of stochastic differential equations of the type (3.3.12) is contained in Geman and Hwang
(1986). In particular, it is shown that the trajectory Xt, corresponding to the solution of
(3.3.12), has a stationary distribution with the Gibbs density (3.3.6) under some
conditions on f and 'X and T(t)~T=const, when t~oo. Moreover, the following is also
shown for the case when the global minimizer x* of f is unique: if there exists an
extension of f to an open set containing 'X that is twice continuously differentiable and
has no local minimizers outside 'X, t~oo, and T(t)=c/log(2+t), where c is a sufficiently
large real number, then the trajectory of the process Xt solving the equation (3.3.12) has
limit distribution concentrated at x*.
A discretized version of the stochastic differential equation (3.3.12), obtained with the
help of standard methodology, is the Markov chain

(3.3.13 )

where {a.k} is a sequence of real numbers, O<a.kS1, and {Tlk} is a sequence of


independent Gaussian random vectors with zero mean and unit covariance matrix. The
search of the global minimizer in the algorithm (3.3.13) is performed in the direction of
the antigradient corrupted by a Gaussian noise. The attributes of the algorithm are similar
to those of the simulated annealing method.
Main Concepts and Approaches 99

General results on the asymptotic behaviour of the solutions of the simultaneous


stochastic equations (3.3.12) and their discrete analogues were obtained by Kushner
(1987) who studied the simultaneous equations

and discrete processes

(3.3.14)

where {Ck} is a sequence of bounded random variables (possibly, correlated), cr is a


nonnegative Lipschitz function on 'X, function v(., C) satisfies the Lipschitz condition
uniformly in C, Ev(x, Ck )=Vf(x), ak=coflog(k+cl), e(t)= Co !1og(t+q), co>l, q>1.
The random vectors ~k play the role of random noise arising under evaluation/estimation
of the gradient of f. Such a noise occurs e.g. when Monte Carlo estimates of the gradient
are used, or the objective function and/or its derivatives are subject to random error.
Algorithm (3.3.14) is a generalization of many local optimization type stochastic
approximation algorithms (that are determined by setting cr=O) and so it can be termed as
multiextremal stochastic approximation . This term was used for a special form of the
algorithm (3.3.14), see Vaysbord and Yudin (1968). The properties of (3.3.14) were
considered also in Khasminsky (1965), Bekey and Ung (1974), Katkovnik (1976),
Aluffi-Pentini et al. (1985) and in some other works. Moreover, the approach of Section
2.1.6 is closely connected with the one presented above, hence that it can be investigated
with the help of the above mentioned results.
It should be finally noted that - to the author's knowledge - no versions of the
algorithms are known to possess acceptable (competitive) numerical efficiency.

3.3 .4 Global stochastic approximation: Zielinski's method


Zielinski (1980) proposed and studied another Markovian algorithm of global optimization
of a function f, subject to random noise. It was called global stochastic approximation,
choosing the acceptance probabilities Pk of Algorithm 3.3.1 in the form

if
(3.3.15)
if

where r: :R.~(0,1] is a nondecreasing function, Yk and T1k are results of evaluation of fat
points Xk and Zk (the latter point is distributed according to Qk(xk,.». Correspondingly,
the essence of the algorithm can be expressed by the rule
100 Chapter 3

and
(3.3.16)
otherwise

where uk is a random number. Comparing (3.3.4) with (3.3.15) we may hence conclude
that in a sense the simulated annealing algorithm and (3.3.16) are opposed. The former
method accepts each profitable step and also some unprofitable ones, while the latter
rejects all unprofitable steps and even some profitable ones. Of course, the
reasonableness of the algorithm (3.3.16) is due to the presence of random noise in the
evaluations of f. Zielinski (1980) and Zielinski and Neumann (1983) proved the
following.

Proposition 3.3.1. Let the global minimizer x* of f be unique and f be continuous in a


vicinity of x*. If the random noise present in the evaluations of f is uniformly bounded
almost surely, then the sequence of random points xk determined by (3.3.16) converges
to x* in distribution. If the random noise is not bounded almost surely, then the sequence
of distributions of the points (3.3.16) has a limit distribution having a density cp(x) which
reflects certain features of f (for instance, it has the property: if f (x):5f (z) for x and z from
'X then cp(x)~cp(z».
The above works of R. Zielinski paid attention mainly to the case where the
distributions Qk(x,.) are uniform on 'X (in this case the algorithm (3.3.16) is a
generalization of Algorithm 3.1.1 for the case of random noise). Besides the mentioned
works studied some features of cp and provided recommendations concerning the choice
offunction r and transition probabilities Qk(x,.) in the algorithm (3.3.16).

3.3 .5 Convergence rate of Baba's algorithm

Let us now return to the case where f is not subject to random noise and consider the
variant of Algorithm 3.3.1 in which

if
(3.3.17)
if

and the transition densities ~ have the form ~(xk,x)=CP(X-Xk) where the density cp is
supposed to be given on :Rn, continuous in a neighbourhood of zero, and cp(O»O. Note
that (3.3.17) implies that the acceptance probabilities Pk of Algorithm 3.3.1 are

if f(Zk)::; f(x k)
if f (zk) > f (x k )
Main Concepts and Approaches 101

The convergence of the above algorithm was studied first by Baba (1981): this is the
reason for the algorithm being referred to as Baba's algorithm.
Dorea (1983) proved the following result on the rate of convergence of the algorithm
(note that a statement on convergence follows from this result).

Proposition 3.3.2. Set

~o = Jln(A(B» inf cp(z- x), ~l = Jln(A(B» sup cp(z- x),


zEA(~). zEA(~).
xeX\A(8) xeX\A(8)

where Jln is the Lebesgue measure and let EVB be the mean number of random vectors zi,
obtained in Baba's algorithm, required for a sequence {xk} to attain the set A(8). Then
L~EvB 9--1 and the quantity L1 attains its minimal value for the case when the algorithm
at hand coincides with Algorithm 3.1.1.
We shall present now a more refined result on the convergence rate of a particular case
of Baba's algorithm, revising some unpublished results of V.V. Nekrutkin; other
asymptotical properties of the algorithm are investigated in Dorea (1986).
Assume that X=[O,l]n is the unit cube, the rule (3.3.17) is applied and the transition
density ~=q has the form

(3.3. 18)

where (Xe [0,1), ae (0,2] are parameters of the algorithm, Po(x)=l, xeX is the uniform
density on X, furthermore

b(a, z) if x e :Da(z)
'Pa(z,x) ={ (3.3.19)

where
° otherwise,

n a(z) = Xr{ z(1) - ~, z(1) + ~J x ... x [z(n) - ~, z(n) + ~J,


and
z= (z(1), ... ,z(n».
102 Chapter 3

(Under a fixed ze'X, the function (3.3.19) is the uniform density on the set :Da(z) which
is the intersection of the cube 'X and the cube with the centre at a point z and a side length
a.)
The choice of the transition density (3.3.18) implies that on each iteration of the
algorithm the uniform distribution on 'X (with probability I-a) or the uniform distribution
on :Da(z) (with probability a) is sampled. Note that in two particular cases (for a=O or
a=2) the algorithm at hand coincides with Algorithm 3.1.1, since the density (3.3.18)
becomes the uniform density on 'X. Note also that from the definition of b(a,z) there
follows

(3.3.20)

here the left-side relation becomes equality, when each coordinate of a point z lies in the
interval [a/2, l-a/2], and the right-side inequality does for the vertices of the cube 'X.
Before studying the properties of the algorithm, we formulate an auxiliary assertion
well-known in the theory off\.larkov processes.

Lemma 3.3.1. Let XbX2, ... be a homogeneous Markov chain, 'tC be the Markovian
moment of first hitting into the set cc 'X, Ex'tC be the expectation of 'tC provided XI =x,
A and B be two measurable subsets of 'X,

Pr x{x'tB e dz}
be the conditional distribution of the vector x'tB provided xI =x. Then the relation

(3.3.21)

is valid.
The inequality (3.3.21) shows that the mean time before reaching the set A by a
trajectory of the Markov chain does not exceed the mean time for reaching this set by those
trajectories which have to visit B before reaching A.
Let us take A=:De(x*), where x* is a global maximizer, and

B= :D a _ e (x*)r1{x e 'X: f(x) ::;; inf Hz)},


:IE X\t)a_e(x*)

where e (0< e <a) defines a required solution accuracy, and estimate the mean number of
iterations for reaching A by Baba's algorithm with transition densities determined via
(3.3.18) and (3.3.19).
By virtue of continuity of f in a neighbourhood of x*, there exists such a constant ~>O
(depending on f, but independent of a, e) that Iln(B)~~(a-e)n. Applying the inequalities
Main Concepts and Approaches 103

(3.3.20) and (3.3.21), together with the fact that, for each XE B, the value I (x) does not
exceed the values of f outside the set Da-E(x*), we obtain

-1
= (1- a + aa-n ) b(E,x*) ~ 2 n I[ (1- a + aa-n)En],

Define

<I>(E, a, a) = 1 + 2n
----=-_=_~ (3.3.22)
(l-a)~(a-E)n (l-a+aa n)E n '

then the obtained estimate for the mean number of iterations to be performed for reaching
the set A=DE(x*) can be written as

sup Ex't A ~ <I>(E, a,a).


XE'X

Let us investigate now the question concerning the rate of increase of <I>(E,a,a) for E-70.
From (3.3.22) it follows that if a and a do not depens on E, then the order of increase of
<I> equals £-n, E-70, and coincides with the order of increase of the quantity E'tA for
Algorithm 3.1.1 for which

for E -70.

Alternatively, selecting a, a in dependence on the desired solution accuracy E, the order of


increase of <I> may be considerably decreased. Indeed, set

a(E) = a = const, 0 < a < 1, <p(E) = a(E) - E,


<pee) -70, <p(E)/E -700 for E-7o.
104 Chapter 3

Then

1 2n
CP(E,a(e), ex) = n + -n x
(1- a)/3(<p(e)) (1- a)£n + a(1 + <p(E)/e)

-n
(<p(e)) 2n n
(1- a)13 + a-(<p(e)/e) for 10 ~O.

Hence, for <p(e)'='e, e~O, the order of increase of <11(10, aCe), a) is e- n/2: this is optimal
since for a(e)~const>O, a(e)~O, a(e)~l, as well when omitting the supposition <p(e)!
e~oo for e~O, the order e-n/2 is getting worse.
Now set a(e)=a, a(e)= e+d.,JE and find the optimal values of the constants a and d.
Transforming <II we obtain

for 10 ~O.

Thus the problem of (approximately) optimal selection of the constants a and d is reduced
to the minimization problem of the function

-1 n
'P(a, y) = [(1- a)13y] + 2 yla,

where y=d n , on the set aE (0,1), y>O. Setting to zero the partial derivatives of the
function 'P we get a unique solution of the obtained simultaneous equations: a=O.S,
y=2- n/ 213- 1/ 2 . Hence, it follows that the quasi-optimal values of parameters of the
algorithm are: a=O.S,
-l/2n. ~
a=a(e)=e+13 v 10/2;

under such parametrization the estimation below is valid:

n/2+2 -1/2 -n/2


supEx't n ( *) ~ <II(e,a(e),a)""2 13 10 , 10 ~O.
X E X
Main Concepts and Approaches 105

3.3.6 The case of high dimension

Assume that the dimension n of 'X is high. Although the usual versions of random search
algorithms are generally in this situation much simpler than the deterministic algorithms,
they are, nevertheless, rather labourous because n-dimensional random vectors have to be
sampled at each iteration. The relative numerical demand of the algorithms of this section
for great n is very modest - only two random variables are to be sampled at each k-th
iteration (10 1) of these algorithms.
The approach below can be referred to as a modification of the method considered in
Section 3.3.2 and consists of sampling a homogeneous Markov chain with a given
stationary density.
Choose a nonnegative function 'P defined on X such that the univariate densities
proportional to anyone-dimensional cross-section of the function are easily sampled and
the transition probabilities Qk(z,.) as follows

Qk(Z, dx) = Q(z,dx) = c(z)fT(z,dt)'P«x,t» dx (3.3.23)

where T(z,dt) for each ZE X is a probability distribution on the set of lines passing
through the point z and
n
T(z,dt)= L q.O(z,dt- t.) (3.3.24)
i=l I I

where
n
q. >0
I
(i= 1, ... ,n), L q. = 1, (3.3.25)
i=l I

o(z,dt-ti) is a distribution concentrated on the line passing through z and parallel to the
vector ti ; {t1 ,... ,t n } is a set of n-dimensional linearly independent vectors; (x,t) denotes
the linear coordinate of a point x on the line t (projection of x on t); c(z) for each z is a
normalization constant. The transition probability (3.3.23) is sampled applying the
superposition technique: first, a random line is chosen passing through z and parallel to
one of the vectors ti by sampling from the distribution induced by the probabilities
(3.3.25) and then a one-dimensional distribution is sampled on the line with the density
proportional to 'P.
Note that the most natural way of choosing qi is qi=l/n (i=1, ... ,n) while the vector
set {tl>... ,tn } can be chosen as the set of the coordinate vectors. Note also that the above
way of choosing the transition probability Q(z,.) draws on the idea of Turchin (1971).

Proposition 3.3.2. Let the function 'P be non-negative and piecewise continuous, the
set X'={XE X: 'P(x»O} be connected, 'P(x1»0, and a homogeneous Markov chain be
sampled with transition probability (3.3.23) for which (3.3.24) is satisfied. Then the
sampled Markov chain has a stationary distribution whose density is proportional to 'P.
106 Chapter 3

Proof Under the above formulated conditions the transition probability (3.3.23) meets the
Doeblin condition, i.e. there exists a natural number m, two real numbers fIE (0,1) and
102>0, and probability measure v(dx) on (X,B) such that for XE X' there follows
QID(X,A)~E2 from V(A)~E1' ACX'. Indeed, if one chooses

v(dx) = 'P(x)dxl J'P(z)dz,

then the Doeblin condition means that the density qID(Z,X) of the probability of transition
from any point ZE X' to any point XE X' in m steps is positive which is true (also) for
m=n. Since the Doeblin condition is met, the Markov chain under study has a stationary
distribution with the density p(x) which is the unique solution (in the class of all
probability densities) of the integral equation

p(x)= Jp(y)q(y,x)dy.

The fact that 'P is a solution of this equation follows from Turchin (1971): this proves the
proposition.
Sampling of one-dimensional densities will be especially simple if constant or
piecewise constant functions are used as 'P. If exp{ -A.f (x)} or another function connected
with f is used as 'P, then the stationary distribution density will be proportional to it, but
one may face a hard sampling problem for one dimensional distributions: if sampling
relies on the rejection technique with a constant majorant, then the efficiency of the
resulted optimization algorithm for f will not differ appreciably from that of Algorithm
3.1.1. If one succeeds to employ more efficient sampling methods (e.g. that of inverse
functions, if the cross-sections of'P along some directions have readily sampled forms, or
the rejection technique having good majorants), then the efficiency of the algorithm
derived may be significantly superior to that of Algorithm 3.1.1. These devices work only
when one knows the analytical form of f; otherwise the procedure described below can be
used. This procedure relies upon the fact that under certain conditions an appropriately
chosen subsequence of a stationary Markov sequence with stationary density proportional
to 'P may be regarded as a stationary Markov chain with stationary density proportional to
w (exp{ -A.f) or any other function related to f may be used as w).
Before formulating the procedure we shall introduce several notations and prove an
auxiliary assertion. Let (X,B) be a measurable space, PI be a probability distribution on
it, Q(z,dx) be a Markov transition probability, P(z,dx)=(1-g(z»Q(z,dx) where O::;g::;l,
(g(z) is the probability of termination of the Markov chain with transition probability
P(z,dx) in a point ZE X), 't be the termination moment of the Markov chain Xl ,X2, ... with
initial distribution PI and transition probability P(z,dx), i.e.

(3.3.26)

where Xl has distribution Plo Xi (i=2,3, ... ) has distribution Q(xi-1,.), Uj are random
numbers.
Main Concepts and Approaches 107

Below the minimal positive solution of an integral equation, i.e. the one obtained
using the method of successive approximation, is taken as the solution. For example, for
the integral equation

G(dx) =f G(dz)P(z,dx) + PI (dx) (3.3. 27)

the solution is represented as the measure

G(dx) = L P m(dx)
m=l

where for m=2,3, ...

P m(dx) =f PI (dz)P
m-l
(z,dx),

Note that the minimal positive solution always exists and is unique.

Theorem 3.3.1. With the above notation, the following statements hold:

1 Pr{'! < oo) = fg(x)G(dx),


2. If Pr{'! < oo} = 1 then F(dx) = g(x)G(dx)

is the probability distribution of the random vector ~'! where ~ 1,,,., ~'! are random vectors
connected into the Markov chain with initial distribution PI and transition probability P,

3. If Pr{'! < oo} = 1 then


E'! = G(X), (3.3.28)
4. If G(X) < 00 then Pr{'t < oo} = 1.
Proof.

1 We have
00

Pr{'t < oo} = L Pm


m=l
where
Pm =f g(x)P m (dx) (3.3.29)

stands for the probability of termination of the Markov chain at the m-th step. Thus
108 Chapter 3
00

Pr{ 't < oo} = L f g(x)P m(dx) =


m=l
00

=5 g(x) L Pm (dx) = 5 g(x)G(dx).


m=l

2. For any A E :B the following holds


00

P{Xt E A} = I. Pr{g(x l ) < aI' ... ,g(x m_ l ) < am_l,g(x m) ~am' xm E A} =


=1

00

= I. 5 (1- g(x 1)' . -( 1- g(x m_ l )) g(x m) P l(dx l)Q(x l,dx 2 )···Q(x m-l' dx m) 1 A (x m) =
m=l x m

00

= I. fp m (dx)g(x) 1 A (x) = f g(x)G (dx) = F(A).


=1 A

Since Pr{ 't<oo }=1, it follows from the first assertion and from the fact that G is a measure
that F is the probability distribution of the random vector x't.
00

3. By definition E't = L mPm where Pm is defined by (3.3.29). Determine now


m=l
the expression for b m = Pm (X). We have

m-l 00

Since b i =PI(X)= 1, so b m = 1- L P k' but Pr{'t <oo}= L P k = 1,


k=l k=l

therefore b m = L P k' Compute now


k=m

00 00 00 ook 00

G(X)=fG(dx)= L b m = I. L P k = L L pk=I. kpk=E't.


m=l m=l k=m k=1 m=l k=l
Main Concepts and Approaches 109

4. Integrating (3.3.27) we have

G(X) =JJG(dz)P(z,dx) + 1

which is equivalent to

G(X) =JG(dx) - I g(x)G(dx) + 1

whence taking into consideration the fIrst assertion and the fact that G(X)<oo we get
Pr{ 't<oo }::::1. The theorem is proved.
Notably, if the Markov chain x1,x2, ... starts at a fIxed point xeX, i.e. if xI ::::x with
probability one, then the measure G corresponds to the potential kernel of the chain in
potential theory, cf. Revus (1975). In this theory (3.3.28) was known for g(x)::::lA(x),
ACX.
Theorem 3.3.1 can serve (and did serve for the author) as a basis for the creation and
investigation of sampling algorithms for a wide range of probability distributions. The
approach is thoroughly described in Zhigljavsky (1985).
Denote by ~l> ~2' ... a homogeneous Markov chain with an initial distribution Fo and
Markov transition probability Q(z,.). Set P(z,dx)::::(l-g(z))Q(z,dx), where g is a function
on 'X, O~gsl, 'to::::O,

where CXj are random numbers. According to Theorem 3.3.1 the random vectors ~'tk
(k::::l,2, ... ) have distributions Fk representable by Fk(dx)::::g(x) Gk(dx) in terms of Gk,
the solution of the equation

G k(dx) =I G k(dz)P(z,dx) + F k-1 (dx). (3.3.30)

It is proved below that under some assumptions, for any initial distribution Fo the
consecutive approximations (3.3.30) weakly converge to the solution of the equation

G(dx) =I G(dz)P(z,dx) + F(dx) (3.3. 31)


where F(dx)::::g(x)G(dx).
Denote by v0 a bounded function that is continuous on X, and introduce a sequence
of functions vk (k=1,2, ... ) induced by Vo which are the solution of the corresponding
integral equation

v k(z) =I P(z,dx)v k(x) + g(z)v k-1 (z). (3.3.32)


110 Chapter 3

Proposition 3.3.4. Let X be a compact subset of :R. n , for some m~l the m-step
transition probability pm(z,dx) is strictly positive (i.e. pm(z,:O»O for any ZEX and each
open hyperrectangle :OCX), the method of successive approximations for (3.3.32)
converges in the metric of C(X) for any function vk-l continuous on X, the family of
functions {vk} is equicontinuous.Then F(dx)=g(x)G(dx), where G is the solution of
(3.3.31), is a probability distribution and the sequence of distributions
Fk(dx)=g(x)Gk(dx) (k=1,2, ... , Gk are the solutions of (3.3.30» weakly converges to
F(dx).

Proof Assume first that m=1.


Let Vo be an arbitrary continuous function on X. We shall show that
Mo=maxvo?maxvl =M 1. Indeed, by (3.3.32) follows

VI (z) $ MI JP(z, dx) + g(z)v o(z) = M I(1 - g(z» + g(z)v o(z)

whence (VI (z)- M1)Jg(z)+ M 1.90(z) for all ZE X. By passing to maxima obtain that
M 1.S.M o (in doing so, it is evident that if VI (x);z:M 1 then g(x 1*»0 where Xl * is a
maximizer of VI). From the strict positiveness and continuity of Vo and VI obtain that
M 1=Mo if and only if vI (x)=M 1 (and, consequently, vo(x)=M 1).
In virtue of the Arzela theorem there exists a subsequence vkm of the sequence vk that
uniformly converges to a continuous function uo . Then vkm+ 1 converges to u 1, the
transformation (3.3.32) of uo. Since the numerical sequence max vk is monotone and
bounded,

M=limmaxv k
k~oo

exists. Since the convergence to uo and u I is uniform, max uo = max u 1=M and,
therefore, uo= u 1=M. The limit, thus is independent of the choice of vkm, hence, vk~M
uniformly for k~oo .
Now we shall prove that

Jv k-l (x)F k (dx) =Jv k (x)F k-l (dx) (3.3.33)

is valid. Indeed, by definition we have the chain of relations


Main Concepts and Approaches 111

f V k-l (x)F k (dx) = f v k-l (x)g(x)G k(dx) =

=f[ V k(X) - fp(x,dz)v k(Z)]G k(dx) =

= ~ ff F k_ 1(dz)pm-l(z,dx)v k (x)- i: JJFk-l(dz)pm(z,dx)vk(x)=


~1 m~

Let now F1 be an arbitrary probability distribution on (X,:B), vI be an arbitrary


continuous function on X. From (3.3.33) follows

f V 1(x)F k(dx) = f v k(x)F 1(dx), k = 1,2, ...

but since vk~M, therefore

in virtue of the Lebesgue theorem on dominated convergent sequences. Thus, we see that
f VI (x)Fk(dx) converges for k~oo for any continuous function VI, whence it follows by
Feller (1966), Section 1 of Ch.8 that there exists a probability measure F that is a weak
limit of {Fk}' The validity of (3.3.31) now obviously follows from (3.3.30) and the
proposition is proved for the case of m= 1.
In order to pass to arbitrary m, apply the above assertion m times and substitute
pm(z,.) and Fi(dx), respectively, for P(z,.) and F1(dx) (i=l, ... ,m). The limit of the
sequences
{F mk+i (dx)} 00

k=O
will be the same and, therefore, {Fk} also will converge to this limit. The proof is
complete.
Proposition 3.3.4 is a generalization of the ergodic theorem for Markov chains from
Feller (1966) to the case of ~1 and function g that is not identically equal to 1.
Rewrite now (3.3.31) as

(1- g(x»G(dx) = J[(1- g(z» G(dz)]Q(z,dx).

By assumption, the transition probability Q(z,.) is chosen so that all solutions of the
equation have the form
112 Chapter 3

C'¥(X)dX' XE 'X
(1- g(x»G(dx) = { (3.3.34)
0,
where c is arbitrary positive constant.
Let us require that the stationary distribution F of the sequence ~'tk be proportional to
w, i.e. that the relation

w (x)dx/ f w(z)dz, XE 'X


F(dx) = g(x)G(dx) = { (3.3.35)
0, x e: 'X
be satisfied. Then from (3.3.34) and (3.3.35) follows

g(x) = w(x)/[w(x) + b'¥(x)]. (3.3.36)

Although the constant


b = cf w(z)dz

may be taken arbitrarily, numerical calculations indicate that

b ... fw (z)dzl f'¥(z)dz (3.3.37)

assures the most rapid convergence of ~'tk to the stationary distribution F.


Assume now that exact equality holds in (3.3.37). Then the efficiency of the resulting
optimization algorithm depends on how frequently the inequality g(~i)?CX:i is fulfilled. The
probability of this event asymptotically equals

p = I g(x)'¥(x)dx = I [w (x)lI w(z)dzH'¥(x)/I'¥(z)dz]


f'¥(z)dz [w(x)lfw(z)dz] + ['¥(x)/f'¥(z)dz]

If the behaviour of '¥ resembles that of w, then p is close to one and the algorithmic
efficiency is high. Otherwise (if'¥ is small where w is large and large where w is small)
both p and the efficiency are low. The profile of behaviour of,¥ thus, should be close to
that of w. If prior information concerning the behaviour of f is missing, then the function
should be estimated in the course of optimization.
The above algorithm can be generalized for the situation where evaluations of w(x) (in
partiCUlar, we can use w(x)= - f(x)+const) are subject to a random noise 11(x), w+11?O
a.s. In this case the analogue of (3.3.36) is as follows

g(x) = [w (x) + 11 (x)]/[ w (x) + 11 (x) + '¥(x)]

and the subsequence ~'tk has a stationary distribution proportional to the function
Main Concepts and Approaches 113

-1
h(x) = {E[ w (x) + 11 (x) + \!'(x)] -I} - \!'(X)
rather than to w(x). The behaviour of this function is related to some extent to that of
w(x). Indeed, let the distribution of the random variable 11 (x) is independent of x and two
points x, ZE'X be chosen such that \!'(x)= \!'(z) and w(x»w(z), then simple calculations
give h(x»h(z).
CHAPTER 4. STATISTICAL INFERENCE IN GLOBAL RANDOM
SEARCH

This chapter considers various approaches to the construction and investigation of global
random search algorithms based on mathematical statistics procedures. It appears that
many algorithms of this chapter are both practically efficient and theoretically justified.
Their main theoretical feature is that convergence of the algorithms studied does not hold
in the deterministic sense or even in probability, but is valid with some reliability level
only. Thus, the algorithms can miss a global optimizer, but the probability of this
unfavourable event is under control and may be guaranteed to be arbitrarily small.
The chapter contains six sections.
Section 4.1 is auxiliary and specifies the ways of applying mathematical statistics
procedures for constructing global random search algorithms.
t,
Section 4.2 treats statistical inference concerning the optimal value of a function on
the base of its values at independent random points: much attention is paid to the issue of
accuracy increase by using prior information about the objective function.
Section 4.3 describes a general class of global random search methods which use
statistical procedures; furthermore, it generalizes the well-known family of the branch and
bound methods, permitting inferences that are accurate only with a given reliability (rather
than exactly).
Section 4.4 modifies the statistical inference of Section 4.2 for the case, in which the
points where f is evaluated are elements of a stratified sample rather than an independent
one. It also demonstrates the gains implied by the stratification and evidences that the
decrease of randomness, if possible, generally leads to increased efficiency.
Section 4.5 presents statistical inference in the random multistart method which was
described in Section 2.1 and serves as a base of a number of efficient global random
search algorithms. The statistical inferences can be applied for its control and
modifications.
Finally, Section 4.6 describes statistical testing procedures for distribution quantiles
based on using the so-called distributions neutral to the right. They can be optionally
applied for checking the accuracy in a number of global random search algorithms, as well
as for the construction of a particular algorithms of the branch and probability bound type.
As earlier, the feasible region X is supposed to be a compact subset of Rn, having a
t
sufficiently simple structure, the objective function can be evaluated at any point of X
without noise. In contrast with the preceding chapters, the maximization problem of is t
considered rather the minimization one. (This decision serves for an easier application of
the related results of mathematical statistics; note that the transcription between minimum
and maximum problems is obvious.)

4.1 Some ways of applying statistical procedures to construct


global random search algorithms
Theoretically, there is a scope of sophisticated results in the field of mathematical
statistics, applicable during the construction of global random search algorithms. The
number of works containing attractive mathematical results is comparatively small: most of
them are related to the topics of Sections 4.2 - 4.5. The aim of this auxiliary section is to
review some related results and applications in other fields.
114
Statistical Inference 115

4.1.1 Regression analysis and design

Among the possible applications of design and analysis of experiments in linear


regression, of considerable interest are those dealing with the construction of local search
algorithms in the case when evaluations of the objective function are subject to a random
noise. They concern the theory of extremal experimental design and rely on supposition
that evaluations of the objective function I are always made the vicinity of points where
the function can be rather well approximated by a frrst- or second-order polynomial whose
unknown parameters are estimated. This operation is aimed at determining the direction of
further evaluations of I. Section 8.1 will deal with this problem.
Some attempts (for instance, see Chichinadze (1969), Hartley and Pfaffenberger
(1971» of using linear regression analysis methods for the construction of global search
algorithms were based on the formal construction of linear relations between some
characteristics (such as the values of the objective function in randomly selected points and
the amount of evaluations). The lack of a constructive description of the class of objective
functions where the postulated linear relation is valid with an acceptable accuracy is the
basic obstacle to the theoretical investigation of these algorithms.
The methods of nonlinear regression analysis are of practical interest, if the objective
function is evaluated with a random noise and is acceptably approximated by a regression
function that is nonlinearly dependent on some unknown parameters. In principle, these
methods are applicable to the situation discussed above when

MZ = sup I(x) ( 4.1.1)


xeX

is one of the unknown parameters of the nonlinear regression model describing the
cumulative distribution function (c.d.f.)

F z(t) = J P(dx) ( 4.1.2)


f(x)<t

where P is a probability distribution concentrated on the set ZC'X. Here practically


unsolvable theoretical difficulties are related to the constructive description of proper
objective function classes, similarly to the above case.
Applications of non-parametric regression analysis to global random search
methodology are interesting, in the presence of random noise when evaluating I. First, the
non-parametric estimation of I or some related function(s) underlies the approach
developed in Chapter 5. Second, non-parametric regression estimates of I can be used for
its thorough investigation, with a view on determining points for subsequent evaluations
of I. If, in doing so, the regression function is repeatedly estimates, then regression
should be estimated with higher accuracy in those subsets of 'X, where the previous
estimat€ of I (and, plausibly, I itself) is smaller, i.e. where the probability of locating a
global maximizer is larger. To do so, one may take as the loss function for regression
estimation a ~uadratic function with the weight function proportional to
exp {AI (x)} where A> 0 and j
116 Chapter 4

is the preceding estimate of f.


Section 8.3 will be devoted to some aspects of the theory of non-parametric estimation
of a multivariate regression.

4.1.2 Cluster analysis and pattern recognition


The prominent role of the cluster analysis in global random search methodology is
connected with the so-called candidate points methods discussed in Section 2.1.3. These
are the variants of the famous method named multi start that are practically efficient for a
rather wide class of multiextremal problems and use the cluster analysis algorithms to
distinguish the regions of attraction of local optimizers.
Another idea underlying the use of pattern recognition and cluster analysis methods in
the construction of global random search algorithms is the following. Let us have a
sufficiently representative sample xl, ... ,xN and corresponding values f(Xi), i=l, ... ,N.
By identifying clusters (or recognizing patterns) in the set 'X x f ('X) we want to
determine the subsets of 'X where the objective function f takes sufficiently high values.
These algorithms are essentially intended for identification of promising subsets
(represented usually as spheres) where the estimate of the minimal, maximal, or mean
value of the objective function is regarded as the prospectiveness criterion (see its precise
definition later in Section 4.2).

4.1.3 Estimation o/the cumulative distribution/unction, its density, mode and


level surfaces

The cumulative distribution functions (4.1.2) and the techniques for their estimation are of
considerable importance in global random search theory. One should bear in mind the fact
that since the behaviour of (4.1.2) is of primary interest for those values of t for which
F(t) is close to one, it is unreasonable to estimate F(t) for all t. This fact determines the
specific character of the problem: various solution approaches will be considered in
Section 4.2 and in Chapter 7.
Kernel (non-parametric) density estimates are useful for global random search theory.
Two reasons for this are: the densities that are kernel estimates can be sampled without
knowing their values in points and the properties of kernel estimates are well-studied.
U sing kernel estimates as an example, we shall demonstrate how the non-parametric
density estimates can support the construction of global random search algorithms in the
case of possible presence of a random noise and high labour consumption during the
function evaluations. Assume that we are given a sample { xI ,... ,xN} from a distribution
with density p(x) and that y(x)= f(x)+~(x)~C1~O with probability 1, where E ~ =0.
Choose another density q>(x) on Rn and consider densities (kernels)
-n
q>~(x) = /3 q>(x//3), /3 > 0,

induced by it.
Let us demonstrate (for strict proofs see Section 5.2) that for large N the density
Statistical Inference 117

q R(X) =
p
~
i=1
[y(x.)/
1
I.
j=1
y(X .)]<P R(X - X.)
J p 1
(4.1.3)

is a close estimate of the density

r 13(X) =f r(z)<p 13(X - z)dz (4.14)

which is the smoothed density of

r(x) =f(x)p(x)/ f f(z)p(z)dz. (4.1.5)

Indeed, for all xeR.n we have

and for N~ the following relations hold (in virtue of the law oflarge numbers):

1 N
N L f(x.) -7 ft(x)p(x)dx,
i=1 1

1 N
N L ~(x.)-70,
i=1 1

1 N
N L f(x .)<p R(X - X.) -7 ft(z)<p R(X - z)p(z)dz,
i=1 1 p i p

1 N
N L~(X·)<PR(X-X.) -70.
i =1 1 p 1

If the points Xi (i=I, ... ,N) with weights

N
y(x .)/ L y(x.)
1 j=1 J

are regarded as the points zi of a sample from the distribution with density (4.1.5), which
is true asymptotically for N -700, then (4.1.3) is the kernel estimate
118 Chapter 4

-I N
N L <P R(X - z. )
i=1 p 1

of the density r. This one reflects the features of f. For example, if p is the density of the
uniform distribution on 'X then r(x)= f (x)/ f f (z)dz, i.e. r is proportional to f. The
smoothed density (4.1.4) also has this property under a reasonable choice of f3 and the
sufficient smoothness of f. The choice of f3 is discussed in the corresponding literature
(c.f. for instance, Devroye and Gyorfi (1985». Essentially the less f3 is, the higher is the
kernel estimate accuracy near to the points xi and the less regular is the curve
corresponding to this estimate. Roughly speaking, f3 should not be unduely small and its
choice should be coordinated with the choice of N.
Therefore the density (4.1.3) may be expected to reflect the basic behavioural aspects
of f (under good behaviour of f, large N, and reasonable values of f3), i.e. to be large
whenever f itself is large and small whenever f is small. By studying (4.1.3) we, thus,
are studying (approximately) the behaviour of the objective function. The density (4.1.3)
can be studied without evaluating it at points (the latter may be rather difficult because of
the prohibitive number of points to be evaluated), but rather by making a suitable sample
from the distribution with this density. The distribution with density (4.1.3) can be
generated by means of the superposition method and presents no principle difficulties.
Having a sample {xl, ... ,xN} from a distribution with density p, one can estimate the
mode (i.e. the maximizer of p) and the level surfaces of this density, i.e. the sets

{XE X: p(x)=h} (4.16)

One of the most convenient techniques of the mode estimation is the following. Define a
numerical sequence (m(N)} so that the asymptotic relations

1!2
m(N)= o(N), (NlogN) /m(N)=o(l)

hold for N~oo. Select the least-radius ball among those having their centres at the sample
points and containing at least meN) points. Then the mode of the distribution with density
p is estimated by the vector consisting of the average coordinates of those points that are
inside the ball. For estimating the mode domain n={xe'X: p(x)~h}, bounded by the
surface (4.1.6), one should take in the above procedure the convex minimal-volume
envelope of meN) sample points instead of the minimal-radius ball. The above and similar
procedures have been thoroughly investigated by Sager (1983) for the unimodal density
case. It seems that some of the results can be generalized for the case, when the density
belongs to some classes of multiextremal functions (in particular, small local maxima far
from the global maximizer do not influence). To reduce possible errors and information
losses, the above procedures should be simultaneously applied to several subdomains,
say, by separating clusters at the beginning.
Statistical Inference 119

4.1.4 Statistical modelling (Monte Carlo method)

There are several basic connections between the theory and methods of statistical
modelling and global random search. First. since the global random search algorithms are
based on sequential sampling from probability distributions. sampling algorithms are their
substantial part: the theory of global random search generates in this case specific
sampling problems. For example. the construction of random search algorithms for
extremal problems with equality constraints is possible only, if there exist suitable
sampling algorithms for probability distributions over surfaces. The problem of sampling
on surfaces will be discussed in Section 6.1. Other sampling problems arise e.g. when
optimizing in functional spaces, see Section 6.2.1.
Second. that part of statistical modelling theory which deals with optimal organization
of random sampling for estimation of multidimensional integrals is of importance also in
the theory of global random search. Of special interest in this case is the problem of
simultaneous estimation of several integrals which is exemplified as follows. Let the
distribution sequence {Pk} weakly converge for k~oo to a probability measure
concentrated at the global maximizer x* (the construction of such { Pk } is described in
Section 2.3.3 and Chapter 5). For sufficiently large k. the estimates of JJ (x)Pk(dx) and
JxU)Pk(dx) O=l •...•n) are. respectively. the estimates of max J and x*U) where xU) and
x*U) are the coordinates of the points x and x*, respectively. One more integral, the
normalization constant of the distribution Pk is added to the above n+ 1 integrals.
Moreover. simultaneous estimation of several integrals using the same sample underlies a
number of non-parametric regression estimation methods (see Section 8.3). Finally. many
useful characteristics of the objective function (e.g. mean values on subsets of 'X) are
representable as linear integral functionals of either the objective function itself. or
functions/measures closely related to it (functionals of the form of (5.3.6) or functionals
of estimates of J).
Third. various procedures created within the framework of Monte Carlo theory in an
effort to reduce the variance of linear integral functional estimates are used for construction
of optimization algorithms. For example. Section 4.4 considers stratified sampling. while
Section 8.2 will analyse importance sampling. Let us demonstrate that dependent sampling
procedures may also be used in the search algorithms.
Let 'XcR.. the objective function J be sufficiently smooth and subject to random
noise. J (x)=E J (x.ro) where f (x.ro) is the result of a simulation trial conducted at a point
XE'X under the random occurance ro that in practice is a collection of random numbers
(i.e. realizations of the random variable uniformly distributed over the interval [0.1]). Let
l'
the derivative (Xc) at a point XOE'X be estimated and the approximations

f(x o)'" (J(xo + h) - f (xo + h»)/2h, (4.1.7)


1 N
J<xo+h) ... Ni:J(x O + h.ro i ), (4.18)

1 N
J(xo-h) ... Ni:J(X O - h.ro'i) (4.1.9)
120 Chapter 4

be used for the estimation of t'(x o ) where h is a small value, N is an integer (resp. the
step size and sample size).
In ordinary simulation experiments, all random elements Oli and COi' are different and
the error of the estimator of f(x o) arises from the error of the approximation (4.1.7) (that
asymptotically equals h2 1 f'(x o ) I for h~O), and the error due to randomness of (4.1.8)
and (4.1.9). The latter error is estimated by the value

where y is determined by the significance level of the estimate and

is the variance of the estimate f (Xo, co) of f (xo). Thus, one can refer to

(4.110)

as the error of the approximation (4.1.7) - (4.1.9) under different random occurance coi
and Oli'. If N~oo, h=h(N)~O, h(N)Nl/2~O then (4.1.10) approaches zero while N~oo
and the optimal sequence hopt(N) minimizing (4.1.10) is determined by

The minimum of (4.1.10) is cN-l/3 where

While using the dependent sample procedure one uses coi'= coi for i=1, ... ,N in (4.1.8),
(4.1.9), i.e. the simulations for xo+h and Xo-h are accomplished under the same collection
of realizations of random numbers. Choosing h arbitrary small, one obtains 3y.O 11/2N-I/2
instead of (4.1.10), where

This way, the order of convergence of the approximation as N~oo is higher for dependent
sample. Analogous results are valid for the multidimensional case and the case of higher
derivatives as well. Recently an improvement of the dependent sampling technique was
Statistical Inference 121

created for optimization of discrete simulation models of particular classes, for references
see e.g. Suri (1987).
Note that some of the approaches to global optimization (e.g. see later Section 6.2.1)
need the drawing of a random function g from a parametric functional set G: setting
randomness is equivalent to setting a probability measure on a parameter set. The prior
information is often insufficient for a reasonable choice of this probability measure, and
frequently a uniform distribution on the parameter set is used. Such a choice of
randomness on the parameter set, with great chance, might lead to exotic random
functions from G. Kolesnikova et al. (1987) proposed to construct the probability
distribution by defining uniformity properties on the set of realizations of random
functions from G, see also Zhigljavsky (1985).
It should be also noted that global random search theory can be regarded as a part of
the theory of Monte Carlo method, if the latter concept is interpreted in a broad sense.

4.1.5 Design of experiments

The following basic kinds of experiments are distinguished (see Ermakov and Zhigljavsky
(1987»: regression, factorial, extremal, simulation (i.e. Monte Carlo), and discrimination
(including screening) for hypotesis testing. In the above subsections, possible applications
of procedures for design of regression, extremal and simulation experiments have been
discussed. The theory of screening experiment is also important for global search theory:
it consists in determination of the basic potential for construction of algorithms, in the
construction itself, and in separation from many factors those several ones that define the
relation at issue. Some applications of screening experiment theory in global search were
investigated by Szltenis (1989), see also Section 2.3.2.
122 Chapter 4

4.2 Statistical inference concerning the maximum of a function


This is one of the principal sections of this chapter (often referred to later): its results can
be widely used in global random search theory and its applications.
Subsection 4.2.1 states the problem of drawing statistical inference for M=max f and
presents the main conditions applied in the section. Subsection 4.2.2 outlines the ways of
constructing statistical inferences for M. Subsections 4.2.3 and 4.2.4 describe statistical
inferences for M, borrowing them to some extent from Chapter 7. Subsection 4.2.5 deals
with the problem of estimating the c.d.f. F(t) defined by (4.1.2). Subsection 4.2.6
contains a number of results concerning the prior determination of the value of the tail
index of the extreme value distribution: these results are of great importance for global
random search theory. Finally, Subsection 4.2.7 deduces from previous parts of the
section a result on the exponential complexity of the uniform random sampling algorithm.
Due to their significance for the global random search theory, the main findings of the
section are formulated not only for the maximization problem, but also for the minimization
one.

4.2.1 Statement of the problem and a survey of the approaches for its solution

At each step of most global random search algorithms, there is a sample of points from
subsets ZC'X and values of the objective function f at these points; the distributions for
subsequent points are constructed after drawing certain statistical inferences concerning the
objective function behaviour. The parameters

M Z = supf<x)
xeZ

and the behaviour of the c.d.f. (4.1.2) in the vicinity of the parameter Mz (i.e. for t such
that Fz(t)::: 1), which is unambigously related to the behaviour of the objective function
near the point

x~ = argm~ f(x),
xeZ

are of primary importance in making a decision about the prospectiveness of ZC'X for
further search.
Since for all sets Z and at various iterations statistical inferences are drawn in a similar
manner, for M=M'X=max f and F=F'X they only will be drawn through independent
sample 3={xl, ... ,xN} from distribution P on 'X satisfying the condition P(B(£»>O for
all £>0. It will also be assumed that 'XeR.n and f is evaluated without a noise. The
elements xl ,... ,xN of the independent sample::: are mutually independent random vectors
(a way of their generation can either consist of the direct sampling of a distribution or
include iterations of a local ascent starting at random points, see Section 3.1.6). Some
generalizations to the cases of random noise and of dependence for elements of ::: be
found in Sections 4.2.8 and 4.4.
Statistical Inference 123

Having a sample E, pass to the sample

(4.2.1)

where Yi=f<xi) for i=I,... ,N are independent realizations of the random variable 11 with
c.d.f.

F(t)= f P(dx)=p({xeX:f(x)<t}) (4.2.2)


f(x)<t

and to the order statistics 11 lS112S...S11N corresponding to the sample (4.2.1). The
parameter M=max f is the upper bound of the random variable 1'\ (M=vrai sup 1'\), i.e.

Pr{T\$M-£}<l

holds for any £>0. Now, the problem of drawing statistical inferences for M=max f is
stated as follows: having a sample (4.2.1) of independent realizations of the random
variable 11 with c.d.f. (4.2.2), statistical inferences for the upper bound

M = vrai sup T\ (4.23)

are to be made. The statistical inference described below can both provide supplementary
infonnation on the objective function, at each iteration of numerous global random search
algorithms (in particular, in order to construct suitable stopping rules) and support the
construction of various branch and probability bound methods (see later Section 4.3).
For convenience, the main conditions to be applied in Section 4.2 are collected below.
(a) The function

V(v) = I-F(M-l/v), v>O,

is regularly varying at infinity with some exponent (-a.), 0< a. <00, i.e.

lim [V(tv)/V(v)]
v~oo
=CCX (4.2.4)

for all t>O.


(b) The representation

t i.M, (4.2.5)

is valid, where co>O is an appropriate constant.


(c) The global maximizer x*=arg max f is unique and f is continuous in a vicinity of
x*.
124 Chapter 4

(d) There is an Eo>O such that the sets

A(E)= {x E X: If(x) - tex*)1 ~E}

are connected for all E, 0< E< Eo.


(e) There is such C! >0 that there holds the relation

lim E-n P(B(E» = c r (4.2.6)


e~O

(f) In a vicinity of x* the representation

If(x) - f(x*)1 = w(11 x - x*II)H(x - x*) + o(lIx - x*II~), IIx - x*1I ~o,
(4.2.7)

is valid, where H is a positive homogeneous function on :ltn \ {OJ of order /3>0 (for H the
relation H(A.z)=A./3H(z) holds for all 1.>0, ZE :ltn) and w is a positive, continuous, one-
dimensional function.
(g) There exists a function U: (0,00)~(0, 00 ) such that for any XE Rn, x;t{), the limit

lim If(x*)- f(x*+tx)IU(t) ( 4.2.8)


tJ..O

exists and is positive.


(h) There exist positive numbers El, cl and c2 such that for all E, O<E:5.E} , the two-
sided inequality

(4.2.9)

is valid.
(i) There exist positive numbers E2, /3, c3 and c4 such that for all XE B(E2) the
inequality

(4.2.10)

holds.
If the minimization problem of f is considered, then one has to deal with the lower
bound vrai inf 'll=min f of the random variable'll, make the evident substitution in the
definition of x* and M, and change the conditions a) and b) into
a') the function U(u)=F(M-l/u), where u<O and M=vrai inf'll > -00, regularly varies
at infinity with some exponent (-a), i.e.
Statistical Inference 125

lim U(tu)/U(u) = t- a
u~-oo

for all t >0,


b') the representation

is valid, where co>O is a constant.


All other conditions are unchanged.

4.2.2 Statistical inference construction/or estimating M

Several basic approaches to estimate the optimum value M=max f will be outlined below.
An approach involving the construction of a parametric (e.g. quadratic in x)
regression/approximation model of f (x) in the vicinity of a global maximizer is often used
in the case, when evaluations of f are subject to a random noise and it is likely that the
vicinity of x* is reached where the objective function behaves well (for example, f is
locally unimodal and twice continuously differentiable), see later Section 8.1.
Another approach (see Mockus (1967), Hartley and Pfaffenberger (1971),
Chichinadze (1967, 1969), Biase and Frontini (1978)) is based on the assumption that the
c.d.f. (4.2.2) is determined (identified) to a certain number of parameters that are
estimated via (4.2.1) by means of standard mathematical statistics tools. Essentially, the
results of Betro (1983, 1984) (described later in Section 4.6) can be also classified to this
general approach. It has three main drawbacks:
(i) for many commonly used classes of objective functions, the adequacy of the parametric
models is not valid and there do not exist yet constructive descriptions of functional
classes for which these models have acceptable accuracy;
(ii) the presence of redundant parameters decreases the estimation accuracy of M and
increases the computational efforts; and
(iii) construction of a confidence interval for M and testing statistical hypothesis about M
are frequently impossible, while that is of primary importance in optimization problems
(an exception being the method of Hartley and Pfaffenberger (1971) designed especially
for confidence interval construction).
Certainly, a non-parametric approach for estimating c.d.f. (4.2.2) together with M can
be used but it is hard to expect its high efficiency. Generally, the semi-parametric
approach described below is more efficient than the both mentioned. Its another advantage
is that it is theoretically justified (if the sample size N is large enough). This semi-
parametric approach is based on the next classical result of the asymptotic theory of the
extreme order statistics, see e.g. Galambos (1978).

Theorem 4.2.1. Let M=vrai sup T\<oo where T\ is a randorr. variable with c.d.f. F(t) and
the condition a) of Section 4.2.1. be fulfilled. Then

(4.2.11)
126 Chapter 4

where

XN =M - aN + 0 (1) for N ~oo, (4.2.12)

aN = inf {vh - F(v) ~ N-1}, (4.2.13)

z<o
(4.2.14 )
Z~O.

The distribution having the c.d.f. (4.2.14) will be called an extreme value distribution and
a - its tail index. The c.d.f. <Pa(z), including the limit case <Poo(z)=exp{ -e- Z }, is the
unique nondegenerate limit of c.d.f. sequences for (1lN-aN)/bN in the case M<oo, where
{aN} and {bN} are arbitrary numerical sequences. The condition a) is a regularity
condition for the c.d.f. F in the vicinity of M: only some exotic functions fail to satisfy it,
see Gumbel (1958), Galambos (1978). In particular, it is fulfilled if the natural condition
b) is met.
The asymptotic representation (4.2.11) means that the sequence of random variables
(1lN-M)/XN, where 1lN is the maximal order statistic from the sample (4.2.1), converges in
distribution to the random variable with c.d.f. (4.2.14).
As shown below, the convergence rate of F(t) to 1 for tiM is mainly defined by the
tail index a of the extreme value distribution. The main difficulties in applying the results
of the extreme order statistics theory are due to the fact that a may not be known in
practice. For this reason, e.g. Clough (1969) suggested a=l that is, of course, generally
false. Dannenbring (1977), Makarov and Radashevich (1981) and some others suggested
- following Gumbel's (1958) recommendation to estimate jointly a, M, and aN via taking
a sample of order N of independent maximal order statistics corresponding to samples for
the r.v. 11. This approach requires an order of N2 evaluations of f and has the drawbacks
(ii) and (iii) detailed above.
The approach below also relies on the theory of extreme order statistics, but is free of
the drawbacks mentioned. It is based on some recent advances of mathematical statistics
described later in Chapter 7, enabling for a prior determination of the value of the tail
index a for a wide range of objective function classes.
The present approach was developed by Zhigljavsky (1981, 1985, 1987),
independently of other authors (cited below) who obtained similar results. The main
similarity is between Theorem 4.2.2 and the outstanding result of de Haan (1981) that will
be properly discussed.
Statistical Inference 127

4.2.3 Statistical inference for M, when the value of the tail index Ct is known

Let us cite from Chapter 7 some results concerning statistical inference for M: the
corresponding proofs, references, comments as well as additional details and information
will be given in Chapter 7.

We shall suppose below that condition a) defined in Section 4.2.1 is satisfied and that the
value of the tail index Ct~l of the extreme value distribution is known. (This is a realistic
assumption for many global random search problems, as will be seen in Section 4.2.6.)
First, let us consider some estimators of M. Linear estimators have the form

( 4.2.15)

where k is an integer, k may depend on N, i.e. k=k(N), k2 (N)/ N -)0 if N-)oo, ai


(i=O,l, ... ,k) are real numbers (coefficients). The estimator (4.2.15) is consistent for
N -)00 if and only if

k
a'A = 2, a. = 1 (4.2.16)
i=O 1

where a=(ao ,al, ... ,ak)' A=(1,l, ... ,l)'.


The quality of the estimator (4.2.15) satisfying (4.2.16) is measured by the mean
squared error that is asymptotically represented as

N -)00 (4.2.17)

where A is the symmetrical matrix of order (k+ l)x(k+ 1) with elements

A.. = r(2/ a + i + l)r (1/ a + j + 1)/ [r(1/ a + i + l)ro + 1)], j ~ i.


IJ

The optimal coefficient vector

a*= arg min a'Aa


a:a 'A =1

corresponds to the asymptotically optimal linear estimator MN,k* and equals

(4.2.18)

The explicit form for the components ai* (i=O,I, ... ,k) of the vector a* is as follows:
128 Chapter 4

a~ = c(a. + 1)/r(2/a. + 1),


a ~ = c(a. - l)r(j + 1)/r(2/a. + j + 1) for j = 1, ... ,k - 1,
J
a ~ = - c(ka. + 1)r(k + 1)/r(2/ a. + k + 1)

where
-I
a.r(k + 2)
[ 2 ]
(a.:J:. 2)
c= (a.-2)r(2/a.+k+l) - (a.-2)r(2/a.+l)

is the normalization constant. (c can also be computed from (4.2.16).)


Under some supplementary conditions namely, a>2, k=k(N)~oo, k2(N)/N~0 for
N~oo) the estimator MN,k*, defined by (4.2.15) and (4.2.18), is asymptotically Gaussian
and asymptotically optimal. Its mean squared error is asymptotically equal to

N ~oo (4.2.19)

where Co is a constant that can be estimated by the estimator

(4.2.20)

which is consistent, asymptotically unbiased, and

Some other linear estimators are also asymptotically Gaussian and asymptotically
optimal (for them an analogy of (4.2.19) is valid). For instance, the estimator (4.2.15)
with coefficients

a. = (k + 1)(2-0.)/0.(0._ 1{(i + 1)1-210. - i l - 2I o.] for i =0, 1, ... ,k-l

~
{
.: (k + 1)('2-0)/0(u - If(k + I) 1-2/0 - k 1-2/0] - (u - 2) (4.2.21)

has these properties.


The same property is possessed by the maximal likelihood estimator
1\
M
that is the solution of the equation
Statistical Inference 129

k-l 1\
(ex - 1) 2. ~ .(M) =k + 1 (4.2.22)
. 0 J
J=
under the condition M~llN; here

(4.2.23)

The maximum likelihood estimator


1\
M
is nonlinear and it cannot be represented in explicit form.
Consider the ways of constructing confidence intervals for M with asymptotic
confidence level l-y, where a small positive number y is fixed.
If k is large, then to construct a confidence interval one may use the above mentioned
asymptotic normality property and the expression (4.2.19) for the optimal linear estimator,
for the linear estimator defined by (4.2.15) and (4.2.21), and for the maximal likelihood
estimator. If N is not sufficiently large, say N<500, then k cannot be chosen sufficiently
great and the suggested method is inapplicable.
Instead of the asymptotical normality property, one may also use the Chebyshev
inequality that gives for linear estimators the approximate relation

N -+00, (4.2.24)

where Co can be estimated by (4.2.20). This approach leads to less precise confidence
intervals, than the preceding one, but is applicable for any k~l.
The one-sided confidence interval

(4.2.25 )

where

-lin ]
[
r k,l-y = 1/ (1 - yVk) - 1

is usually narrower than the one-sided interval based on (4.2.24). It should be noted that
the one-sided confidence intervals for M - with TIN as a natural lower bound - are of more
interest for global random search theory, than the usual two-sided ones.
Statistical hypothesis testing procedures for M are of great importance in the present
context, too. The standard situation here is as follows. Let there be a record value Mo and
an independent sample (4.2.1) from values ofTl=f(~) where ~ is a random vector given on
a subset Z of 'X. Suppose also that
130 Chapter 4

and it is to be decided whether the function f can have values in Z greater than Mo (Le.
whether f can achieve its global maximum in Z). In other words, one has to test the
statistical hypothesis Ho: Mz > Mo under the alternative R I: Mz ~Mo·
For testing the hypothesis Ro, one can use ways similar to those stated for confidence
intervals construction. According to the approach leading to (4.2.25), the rejection region
forRo is

(4.2.26)

where Ik, l-y is given by (4.2.25). In this test the power function

~ N(M, y) =Pr{Y E W}

is a decreasing function of M having the asymptotic representations

for N ~oo

-lla
aJk
PN(M,y)-~,·0ftke- [1-(
DO

t
r
k,l-y
-t
Ie M 0 -M/M-8
) (
+rk,l_y
N)) dt

+ +

where N~oo and (a)+=max{O,a).


A related numerical study shows that for moderate values of N (say N~100), it is
advantageous to choose k accelerating to the rule N/20~k~N/1O; for large N (N) 100) it is
often sufficient to take k not greater than 10. For such a choice of k the precision of the
statistical inferences (4.2.25) and (4.2.26) is almost as good as for the asymptotically
optimal choice k(N)~oo, k2(N)/N~0 for N~oo.
If the minimization problem of f is stated then the condition a') is to be substituted for
a) and in the suitable places of the formulas of this subsection l1i+ 1 is to be substituted for
l1N-i for all i=O,l, ... ,k. Thus, one has to test the hypothesis

instead ofRo: Mz>Mo , and should use the formulas


Statistical Inference 131

k
M Nk = La·ll. l' (4.2.27)
, i =0 1 1 +

(4.2.28)

and
( 4.2.29)

replacing (4.2.15), (4.2.20), (4.2.23), (4.2.25) and (4.2.26), respectively: the other
formulas of the subsection, given above are unchanged.
We turn now to another approach to construct confidence intervals for M which may
be termed as the improved Hartley-Pfaffenberger method. Let us start by quoting an
auxiliary result from Hartley and Pfaffenberger (1971).
It is well known that ti=F(lli), i=l, ... ,N, form an order statistics corresponding to an
independent sample from the uniform on [0,1] distribution and their means and
covariances are E t i =Ili, cov( t i, t j )-Vij (for i,j=l, ... ,N) where

11.1 = i/(N + 1), v .. =V .., v .. = 11.(1-11.) I(N + 2) for i ~j.


Jl IJ IJ 1 J

Note that Vij = VN-j+l,N-i+l for i5j. Define the k-vectors

The covariance matrix V of t is symmetric, has the order kxk: and elements Vij for
15 i5j5 k. According to Hartley and Pfaffenberger (1971), the quadratic form

2
s (k,N) = (t -1l)'V -1 (t -11)
can be written as

2
s (k,N) = (N + l)(N + 2)[
tN-k+l N+l 2
1
N 2_ k + 1 + i=N~k+/ti - t i _1) - (N + 2)

where tN+ 1= 1. The exact distribution of s2(k,N) is rather complicated: a table of


numerically evaluated y-quantiles Sy of this distribution is given in the work mentioned.
132 Chapter 4

The dependence of sy on N is rather mild: one may approximately select s"( for "(=0.05,
0.01 and k=5, 10, 15,20 as follows sO.05=15, 25, 33, 40, s0.Q1=28, 40, 50, 55.
Hartley and Pfaffenberger (1971) used the unreasonable supposition that the c.d.f.
F(t) is represented by a Taylor expansion at t=M, rather, than a condition of type (a). We
shall suppose that (b) is fulfilled and k is such that the extreme order statistics 'TIN, ... ,
'TIN-k+ 1 fall into the vicinity AM of M, where the approximation

is valid with a rather high accuracy. Thus, we may set

for i = 0, 1. ... ,k - 1
(l
tN ~. = F( 'TIN ~.) = 1- c ~
1M - 'TI N-1.)

and put them into the expression for s2(k,N) that depends on Co and M as well as k and
N. Denoting s2(k,N)= s2(k,N ,M, co) and

we come to the upper confidence bound M 1-"( for M of a level 1-"( defined as the solution
of the equation s2(k,N,M l-"()=S"( under the condition M l-"(~'TIN. The corresponding
confidence interval for M is [ 'TIN, M 1-,,(].
Since s2(k,N,M, co) as a function of Co is a polynomial of degree two, the expression
for s2(k,N,M) is easily derived:

S2(k,N,M) = s2(k,N,M,c*)
where

In principle, instead of s2(N,k,M,c*) one may use s2(N,k,M,co) - where Co is an


estimator of Co (cf. (4.2.20)) - as s2(N,k,M).
Roughly speaking, the use of c* corresponds to the max-min approach to a confidence
interval construction, but the use of Co - to a Bayesian one.
Note that the improved HartIey-Pfaffenberger approach described above is simpler
than the original one (and, unlike the original method, it is correct). The author knows
Statistical Inference 133

nothing, however, about the accuracy of the confidence intervals constructed by applying
the improved Hartley-Pfaffenberger approach.

4.2.4 Statistical inference, when the value of the tail index a is unknown

We continue to cite results from Chapter 7.


Suppose that the condition (a) of Section 4.2.1 holds, but the value of the parameter a
is unknown (practically this is quite typical).
In this case, one may apply the presented statistical inferences for M, using an
estimator for a instead of its true value.
Many estimators for a are known: for instance, the estimator

& = (log ~ )/log[(l1 N -11 N_k)/(l1N -11 N-m)] (4.2.30)

can be used in which m<k. If k---+oo, m---+oo, k/N---+O (for N---+oo) then
ft
is consistent and asymptotically unbiased. For the practically advantageous case k=5m,
there holds

lim k(& - a)2 = 1.544 a 2.


k~oo

The estimator

_ ( k) [ ",(k + 1) - ",(1) ",(k+ 1)+ & log (k/rn)]


a = log iii' I log ",(m + 1) - ",(1) log ",(m + 1) + & + & '
(4.2.31)

where
ft
is determined by (4.2.30) and ",(k)=r'(k)1 r(k) is the ",-function, is more precise than
tl.
Set

forO~u~l

for u> 1

and uk,o is such that Fk(Uk,o)= O. Then the confidence region


134 Chapter 4

(4.2.32)

for a has asymptotical confidence level l-y.


The results of the next subsection show that the procedures of testing the statistical
hypothesis Ho: a=n/2 under the alternative HI: a=n are of importance in global random
search theory. To construct a procedure for testing the above hypothesis, one may choose

(4.2.33)

as the rejection region. The power of the test is asymptotically equal to

N ~oo.

In this case the maximal1il~elihood estimator


A
M
for M is the solution of the equation

k-I
[ .L log (I + f3.(~.1))
~-I [k-I
-.L f3 .(M)
]-1 = lI(k + I)
J=O J J=O J

under the condition

where p.(M)
J
is determined by (4.2.23). Under some auxiliary conditions (including a>2, N~oo,
k~oo, k/N~O), this estimator is asymptotically Gaussian with mean M and variance,

For large values of Nand k, this result may be used for constructing confidence intervals
and statistical testing procedures for M.
If k can not be large (for the case of moderate values of N) then an other way of
constructing confidence intervals and hypothesis testing for M is more attractive. Its
essence is in the result of de Haan (1981) that if k~oo, k/N~O, N~oo then the test
statistic

(log k)log[(M-TlN_I)/(M-Tl N)]


(4.2.34)
log[(Tl N _2 -Tl N- k )/(Tl N - 2 -TIN-I)]

asymptotically follows standard exponential distribution with the density e-t, t>O.
Statistical Inference 135

If the minimization problem of f is stated, then the formulas (4.2.30) and (4.2.34) are
to be transformed into

(4.2.35)
and
(log k)log[(ll 2 - M) I (ll 1 - M) J
log[(ll k- II 3)/(ll 3- II 2)J

All other formulas are unchanged.

4.2.5 Estimation of F(t)

Estimators of the c.dJ. F(t) defined by (4.2.2) can be widely used in global random
search algorithms for the prediction of further search, construction of stopping rules or
rules for change to local search (if such a search change is designed). Again, the behavior
of F(t) in the vicinity of M=vrai sup II (i.e. for t such that F(t)'" 1) is of most interest.
The ways of estimating F(t) for t",M are based on the asymptotic representation
(4.2.11) which follows by supposition (a) and can be rewritten in the form

F(t) '" exp { - ~ (M - t)/X N ) a} for t <M, N ~oo. (4.2.36)

This asymptotic representation is valid for all t<M, but the more close is t to M, the more
close the right and left hand sides of (4.2.36) are.
The simplest way of using (4.2.36) for estimating F(t) consists of replacing M, XN,
and ex. by their estimators. Thus, if one uses (4.2.15), (4.2.30) and MN,k-llN as an
estimator for XN, the estimator

(4.2.37)

is obtained for t$MN,k'


If the condition (b) is met, then

for N ~oo

that implies

for N ~oo. (4.2.38)

So one may use


136 Chapter 4

(4.2.39)

or substitute here
(I
for ex. if ex. is unknown.
Combining (4.2.36) and (4.2.39) one obtains

for t iM (4.2.40)

and the corresponding estimator

(4.2.41)

where Co is determined from (4.2.20).


Of course, if an exact value of the tail index ex. can be determined (see Section 4.2.6),
then this value is to be used in (4.2.37) and (4.2.41) instead of
(I.
ff the minimization problem is considered, then the conditions (a') and (b') are to be
substituted for (a) and (b) and the formulas (4.2.36), (4.2.37), (4.2.40) and (4.2.41) are
to be modified as follows, viz.,

t>f * =minf,N~oo,

F(t)-l- exp { - cit _ ( ) (X},


F(t)"" 1-exp{ -er/...t-MN,k)tL}.

where XN is such that

(N ~oo),

the estimator MN,k for f* is determined by (4.2.27) and


(I
using (4.2.35); the other formulas and notions are the same as above.
Statistical Inference 137

4.2.6 Prior determination o/the value o/the tail index ex

Sometimes the value of the tail index <l of the extreme value distribution can be
determined, due to the following obvious result: if a c.d.f. F(t) is a sufficiently smooth in
a vicinity of M=vrai sup 11<00 and there exists an integer ~ such that F(i)(M)=O for O<i< ~
and 0<1 F( OeM) I <00 then (a) is fulfilled and <l= ~, see Gumbel (1958). The above
sufficient condition for (a) is a particularity of (b) for the case of integer ~. If one would
apply the fractional differentiation concepts (see Ariyawansa and Templeton (1983) for
references) then the integrality requirement can be omitted for ~, and the above condition
coincides with (b). Note passing by that the condition (b) is somewhat stricter than (a),
because the latter permits for Co from (4.2.5) to be a slowly varying function.
The distinguishing feature of using the above described statistical procedures in global
random search lies in the fact that the c.d.f. F(t) has the form of (4.2.2). Using this fact
and prior knowledge about the objective function behavior in the vicinity of a global
maximizer, the tail index <l of the extreme value distribution can be often exactly
determined as is demonstrated below. The basic result in this direction is as follows.

Theorem 4.2.2. Let the conditions (c) through (f) (see Section 4.2.1) be satisfied. Then
the condition (a) holds and <l=n!13.

Proof. In terms of the notation used, the condition (a) is represented as follows: the limit
of P(A(t e»IP(A(e» for do exists for all t>O and equals til. It is well known (see de Haan
(1970» that it suffices to require the existence of the limit only for tE [0.5,2].
Put z=x-x*,

g(z) = f(x*) - f(x* + z) - w(llzll)H(z),

ge= sup sup g(z)/w(llzll).


0.5~t~2 xeA(te)

The assumptions (c), (d) and (f) together with the continuity and positivity of w imply
ge/e--+0 for e--+O. Further, for each tE [0.5, 2] and e>O there holds

p(A(te» = P({x E X:H(z) + g(z)/w(lIzll)::;; tE}) ::;; P({x: H(z)::;; te + gel).

Analogously,

p(A(te» ~ P({x: H(z)::;; tE - gel),


peACE»~ ::;; P( {x: H(z) ::;; e + gel),
p(A(e» ~ P( {x: H(z) ::;; e + ge}).

The homogeneity of H gives


138 Chapter 4

for all u>O where c=lln {z: H(z)~1 }.


From (e) and the relations obtained above, one can deduce for fixed tE [0.5, 2]

lim sup P(A(tE»/P(A(E» ~


E-70

~ lim supP({x: H(z) ~tE+ gE})/P({x: H(z) ~E -gel) =


E-70

= lim sup Il n{z E R,.n: H(z) ~ tE + gE}/ll n {z E R,.n: H(z) ~ E - gE} =


E-70
. n/~ n/~ n/~
= lim SUP(tE+g E) I(E-gE) =t .
E-70

Analogously,

lim inf P(A(tE»/P(A(E» ~


E-70

~ lim infP( {x: H(z) ::; tE + gE}) /P{ x: H(z)::; E - gEl =


E-70

n/~ n/~ n/~


=liminf(tE-g E) /(E+gE) =t .
E-70

These relations give the assertion of the theorem.


Let us make some comments on the conditions of the theorem. Conditions (c) and (d)
are typical for studying global random search algorithms. The requirements of the
uniqueness of x* and for the corresponding condition (d) fulfilment are imposed only for
convenience and may be relaxed (see Theorem 4.2.4 below).
The condition (e) is satisfied if the measures P and Iln are equivalent in a vicinity of x*
which is either an interior point of X or a boundary point situated outside an appendix of
the set X of zero Lebesgue measure Iln' i.e. the following should be satisfied

Iln(B(E»/ll n{z E R,. n: Ilzll ~ E} = const + 0(1), E~O.

The condition (t) characterizes the behavior of the objective function near to x*. There
are two important special cases of (t). Assume that all the components of Vf (x*) (the
gradient of f in x*) are finite and non-zero which usually happens if the extremum is
attained on the boundary. Then H(z)=z' Vf(x*), w=l, and therefore ~=1 and <x=n. If it is
Statistical Inference 139

assumed that f is twice continuously differentiable in some vicinity of x*, Vf (x*)=O, and
the Hessian V2f(x*) is nondegenerate then w=l, H(z)=-z'[ V2f(x*)]z, ~=2, and a=n/2.
These two special cases (under somewhat stricter conditions on P and x*) were known
mainly due to de Haan (1981) who deduced them (as well as (a» from the condition (g)
rather than (f). The former case was investigated also by Patel and Smith (1983) for the
problem of concave minimization under linear constraints.
In cases where it is not evident in advance that (4.2.7) or (4.2.8) holds, but the
growth rate of f near x* is known, then the following assertion may be useful.
Theorem 4.2.3. Let the conditions (a), (c), (d), (h) and (i) of Section 4.2.1 be met.
Then a=n/~.

Proof Put E3=min{ q, E2),

It follows from (c) that £5>0 and from (h) and (i) that for all u, 0<u$.£5, and tE [0.5, 2]
the followings holds

Hence,

( 4.2.42)

where

In virtue of (a) the limit P(A(tu))fP(A(u)) exists for u J- 0 and any tE [0.5,2], and it equals
tao It follows from (4.2.42) that any value of a different from a=n/~ is inadmissible. The
proof is complete.
The conditions of Theorem 4.2.3 are slightly weaker than those of Theorem 4.2.2,
except for the requirement of meeting (a), but Theorem 4.2.2 not only infers the
expression for the tail index a, but also substantiates the fact that the convergence to the
extreme value distribution takes place.
140 Chapter 4

The following statement is aimed at relaxing the uniqueness requirement for the global
maximizer.

Theorem 4.2.4. Let P be equivalent to the Lebesgue measure ~n on 'X and let the
function f attain its global maximum at a finite number of points xi* (i=I,2, ... ,f) in
whose vicinities the tail indexes crt can be estimated or determined. Then the condition (a)
for c.d.f. F(t) is met and a=min{ ab ... , a~ }.
The assertion follows from the fact that f is represented under our assumptions as

f
f= Lf·
i=1 1

where

sup p f 1. = 'X.1 E :8, for i,j = I, ... ,~, i "# j

and fi (i=I, ... ,n) is a measurable function attaining the global maximum M at the unique
point Xi*, as well as from two following lemmas.

Lemma 4.2.1. If

'X.1 = sup p f 1.,

for i "# j, (i,j= 1, ... ,0

then
f
F(t)= Lp.F.(t)
i=1 1 1

where
f
L p. = 1,
i=1 1

P.(A)=
1
peA n'X.)/p.
1 1
for A E :8.

The proof is evident:

f
F(t)= 51 [f(x)<t]P(dx)= i::1 J.I[f(X)<t]P(dx) =
1
Statisticallnjerence 141

~ l
= L I I[J ( ) lP(dx) =L p.F.(t).
i=l x. i x <tJ i=l 1 1
1

Lemma 4.2.2. If the condition (a) with parameters (X= (Xi is fulfilled for c.d.f. Fi(t),
i= 1,... , i, then it is also met for c.d.f.

l l
F(t) = L p.F .(t) (where p.1 > 0 for i = 1, ... , i, 2. p. = 1)
i=l 1 1 i=l 1

with the value am=min{ (Xl, ... , (Xd for the tail index.

Proof. Represent the functions

v 1.(v) = 1- F.(M
1
- l/v) (v> 0, i = 1, ••. , i)
as
-(I.
V .(v) = L .(v)v 1
1 1

where L:i(v) are slowly varying functions, i.e. such functions that the limit of L:i(tv)1L:i(v)
exists and equals 1 for any t>O and v~oo.
It suffices to demonstrate that under v~oo the ratio
(I
A(t,v) = t m(1- F(M - l/(tv))/(1- F(M -l/v)))

has limit 1. We have

(I l l l
A(t,v)=t mLP.V.(tv)ILP.V.(V)= 2.p.B.(t,v)
i=l 1 1 j=l J J i=l 1 1

where

=LL p.t
l (I - ( I
i m
·=1 J
142 Chapter 4

If o.po.m then Bi(t,v)--+O (v--+oo) in virtue of the properties of slowly varying functions
(see de Haan (1970». If ~=o.m and o.j>o.i for j:t!:i then Bi(t,v)--+l!Pi (v--+oo). Existence
of limits in these cases is obvious. It remains to consider the case where several ~'s are
simultaneously equal to a.m.
We shall prove the following: let the c.d.f.'s Fi(t), i=l, ... ,r meet the condition (a)
with index a., then the c.d.f.

F o(t)
r
= I, q.F .(t)
j=l J J
( where q. > 0 for j
J
= 1, ... ,r, ± = 1)
q.
j=l J

meets (a) with index a. as well. If this assertion is true and if

then, having assumed that

we reach the case where <lj>a.m for allj=l, ... ,i exceptj=i.


The assertion desired will be proved for r=2 and by induction can be extended to an
arbitrary r.
Through a transformation similar to the above ones, we obtain

AO(t,v) = t<X[l- Fo(M -lI(tv»]1[l - F o(M - lIv)] =

Ll(v) L 2(V)] [ L 1(V)


=ql 1[ ql L 1(tv) +q2L 1(tv) +q2 1 qlLz<tv) +q2L 2(tv)
Lz<v) 1
Putting

c = lim L 1(v)/L 2(v),


V--7 00

(the limit exists, because V 1 and V2 are monotone) obtain

L 1(v) L 1(v) Lz<v)


lim = lim - - lim =c
V--7°o L 2(tv) V--7 coL z<v) V--7 ooL 2(tv) .

Similarly
Statistical Inference 143

lim [L 2(v)/L l(tV)] = 1/c.


v~oo

Therefore, the limit of Ao(t,v) for v---'t oo exists and equals

This completes the proof of Lemma.


Let us present some generalizations of the above results permitting to extend the
applicability domain of the apparatus of this subsection. We shall omit further detailed
proofs and discussions restricting ourself to short comments. Note that an alternative
approach to the exact evaluation of the value of the tail index a was treated in Dorea
(1987), mainly for the univariate case (i.e. n=l).

Proposition 4.2.1. Assume that the set of global maximizers X*={arg max f} is a
continuously differentiable m-dimensional manifold (O.$m.$n-l), f is continuous in the
vicinity of the set X*, and the conditions (d), (e) and (f) are satisfied for all X*EX*.
Then (a) is satisfied and a=(n-m)/~.
The proof is similar to that of Theorem 4.2.2.
From Proposition 4.2.1 and Lemma 4.2.2 directly follows the next.

Proposition 4.2.2. Assume that

~
X *= u X ~
j=l J

where X/ is a continuously differentiable mj-dimensional manifold (O.$mj.$n-1,


j=l, ... ,O; f is continuous in the vicinity ofX* and (d), (e) and (f) are satisfied for all
X*EX/ G=l, ... ,n if clG) and ~j are substituted, respectively, for CJ and~. Then (a) is
fulfilled and

min [en
a= l~j~~ -m.)/~'J'
J J

Another generalization of Theorem 4.2.2 is the next.


Proposition 4.2.3. Let all the conditions of Theorem 4.2.2 be fulfilled but H(z) rather
than being a homogeneous function of order ~, be representable as
144 Chapter 4

H(z)::: HI (z(l» + H 2(z(2» + ... + H t (z(t»


where
n.
J
H.: R. ~R.
J

are homogeneous functions of order ~j> ~.sn, z=(z(1), ... ,z(O )eR.n, zG) are disjoint
groups of nj variables of vector z, n 1+... +n t =n. Then (a) is satisfied and

t
(l::: 2. n./~.
j=I J J

For the simplest case of n=2, i =2, the proof differs from that of Theorem 4.2.2 in the
following: for u~O there holds

:::p({z: Hl(U-1/~lz(l» +H2(U-1/~2z(2» ~ 1})-

-( c/n)1l2{ z: H/ u -1/~1 z(1» + H2(u -1/~2 z(2» ~ I}:::

c1
::: 1t II C 1 1/~1+1/~2 II
u 1/~1+1/~2 D(x,y)dxdy ::: 1t u D(x, y)dxdy
x+ySl x+ySl

where D stands for the Jacobian corresponding to the change of variables x=H 1(z(1»,
y=H2(z (2». For the general case the proof is analogous.
The case of i=2, ~1 =1, ~2=2 typifies situations where the above result proves useful:
this case corresponds to non-zero first derivatives of f at x* with respect to the variables
z(l) and to zero first derivatives and non-degenerate Hessian matrix with respect to the
variables z (2).
The following generalization of the situation, where Theorem 4.2.2 and similar
assertions can be used, is based on the possibility of the exact determination of the tail
index (l in the case where the objective function values f<x) are evaluated with a random
noise whose distribution is not influenced by the location of xe 'X and lies within a finite
interval. The generalization relies upon the following assertion that can readily be proved.
Statistical Inference 145

Proposition 4.2.4. Let the condition (a) be satisfied for distributions Fl, F2 and
F 1*F2, the tail index of the extreme value distribution for Fi being ai (i= 1,2). Then for
the distribution Fl *F2 the tail index is a=max{ al> a2}'
It follows from Proposition 4.2.4 that if the tail index corresponding to the distribution
of the random variable f@ is al and if that of the noise ~ distribution is a2, then the sum
max f+vrai sup ~ is the bound of the random variable y= f(~)+~, and for y only
max {ai, a2} can be the tail index.
Further generalization of the situation to which the above apparatus is applicable is
possible, if one abandons the independence assumption for realizations xi of the random
vector ~ having distribution P(dx). To this end, it should be noted that the apparatus of
statistical inferences for the random variable bounds is based on the fact that the
distributions of random variables (l1N- M)j(M -eN) converge to that of extreme values for
N ~oo. This fact was shown (see Resnick (1987), Lindgren and Rootzen (1987» to be
true not only in the case of independent random variables Yl, Y2, ... , but be generalizable
to cases where this sequence is a homogeneous Markov chain exponentially converging to
a stationary distribution, or strictly stationary m-dependent sequence, or that of
symmetrically dependent random variables. The apparatus described in this section is,
thus, applicable to the majority of algorithms presented in Part 2. Section 4.4 will
investigate another case of dependent sample Y={ Yl, ... , YN} for which the above
apparatus is generalizable.

4.2.7 Exponential complexity of the uniform random sampling algorithm

As a consequence of the results of the present section, we shall obtain a result on the
exponential complexity of Algorithm 3.1.1. The mean length of the one-sided confidence
interval (4.2.25) for M=max f of fixed level l-y is taken as the measure of algorithm
accuracy; we shall study the growth rate of the number N of evaluations of f required for
reaching a given accuracy under n~oo.

Theorem 4.2.5. Let assumption (b) and the conditions of Theorem 4.2.2 or 4.2.3 be
fulfilled. Then, in order to make the asymptotic mean length of the confidence interval
(4.2.25) be equal to E, the number N of objective function evaluations in the algorithm of
uniform random sampling of points in X, has to grow with rate N-coc n (n~oo) where
the parameter Co has the same sense as in (b) and

liP
c = [CW(k + 1) - W(1»/( - E log (1- ilk))] , W(u) = r'(u)j['(u)
is the '{I-function. The theorem's assertion follows from

Lemma 4.2.3. Let the conditions of Theorem 4.2.5 be fulfilled. Then the mean length
of one-sided confidence interval (4.2.25) is asymptotically equal (for N~oo, n~oo ) to
146 Chapter 4

Proof. It follows from Lemma 7.1.2 (given later) that the mean length of the confidence
interval (4.2.25) for N ~oo equals ~ 1 ~ 2 ~ 3 where

~ 1= n[r(1/a. + k + l)/r(k + 1) - r(1/a. + 1)].

~ 2 = 1/[n (( 1 - (1- y/lk) -II a - 1)1 ~ 3 =M - eN.

If the conditions of Theorems 4.2.2 or 4.2.3 are satisfied then a.=n/~ and therefore for
n~oo

_ [n~/n + k + 1) - r(k + 1) r(~/n + 1) - r(1)]


~ 1- n r(k + 1) - r(I) -

It follows from the condition (b) that the representation (4.2.38) is applied with a.=n/~
and thus

N ~oo.

Combining the limit relations for ~ 1. ~ 2 and ~ 3. one obtains the desired result: this
completes the proof of the lemma (i.e. also that of Theorem 4.2.5).
Statistical Inference 147

4.3 Branch and probability bound methods

This section describes a class of global random search algorithms applying the
mathematical apparatus developed in Chapter 7 and in the preceding section. These
algorithms are closely related, in terms of their philosophy, to the branch and bound
methods that were reviewed in Section 2.3 and the key concept in their description is that
of prospectiveness.

4.3 .1 Prospectiveness criteria

A sub-additive set function <p:n~R resulting from processing the outcomes of previous
evaluations of the objective function and reflecting the possibility of locating the global
optimizer in subsets is referred to as prospectiveness criterion. If for two sets ZI, Z2E n
the inequality <p( ZI )~ <p( q) holds, then the location of the global optimizer in ZI is at
least as probable as in q, according to the prospectiveness criterion <po
A number of set functions <p (assuming ZE n) can be used as prospectiveness criteria;
examples are:
a) <p( Z) is an estimate of the maximum

M Z = max f(x),
XEZ

b) <p( Z) is an estimate of the mean value

Jt(x)v(dx)
Z

c) <p( Z) is an estimate of the minimum

min f(x)
XEZ

d) <p( Z) is an upper confidence bound of a fixed confidence level for Mz,


e) <p( Z) is an estimate of the probability P{ M~Mk *}, where Mk* is the maximal
among the already determined values of f.
Chapter 7 and the preceding section consider the construction of prospectiveness
criteria a), d) and e); criteria b) and c) are mentioned in Section 4.1. The author's order of
preference is as follows: e), d), a), b) and c): thus, criteria e) or d) should be used if
possible. In complicated situation (for instance, in case of random noise presence),
however, they cannot be usually constructed, and one has to rely on a b)-type criterion
that can be constructed in rather general situation.
The estimates listed above can be constructed either via the results of evaluating the
objective function, or after investigation of estimates or approximations of f. As a rule, the
estimates derived are probabilistic; deterministic estimates can be constructed only for
148 Chapter 4

certain (e.g. Lipschitz-type) functional classes and correspond to the standard branch and
bound approach treated earlier in Section 2.3.4.

4.3 .2 The essence of branch and bound procedures

Branch and bound methods, used to advantage in various extremal problems, consist in
rejecting some of the subsets of 'X that can not contain an optimal solution and searching
only among the remaining subsets regarded as promising. Branch and bound methods
may be summarized, as successive implementation at each iteration of the following three
stages:
i) branching (decomposition) of the (one or several) sets into a tree of subsets and
evaluating the objective function values at points from the subsets;
ii) estimation of functionals that characterize the objective function over the obtained
subsets (evaluation of subset prospectiveness criteria); and
iii) selection of subsets, that are promising, for further branching.
In standard versions of branch and bound methods, deterministic upper bounds of the
maximum of f on subsets are used as subset prospectiveness criteria. By doing so, all the
subsets Z are rejected whose upper bounds for Mz do not exceed the current optimum
estimate. Prospectiveness criteria in these methods, thus, can be either 1 or 0 which means
that branching of a subset should go on or be stopped.
In the following consideration will be given to non-standard variants of branch and
bound methods that will be referred to as branch and probability bound methods. Their
distinctive feature is that the maximum estimates on the subsets are probabilistic (i.e. these
estimates are valid with a high probability) rather than deterministic.

4.3 .3 Principal construction of branch and probability bound methods

The methods under consideration are distinguished by the (i) organization of set
branching, (ii) kinds of prospectiveness criteria, and (iii) rules for rejecting unpromising
subsets.
Set branching depends on the structure of 'X and on the available software and
computer resources. If 'X is a hyperrectangle, then it is natural to choose the same form
for branching sets Zkj where k is the iteration count, and j is a set index. In the general
case, simplicial sets, spheres, hyperrectangles and, sometimes, ellipsoids can be naturally
chosen as Zkj- Two conditions are imposed on the choice of Zkj= their union should cover
the domain of search and the number of points where f is evaluated should be, in each set,
sufficiently large for drawing proper statistical inference. There is no need for the sets Zkj
to be mutually disjoint, for any fixed k.
Branching/decomposition of the search domain can be carried out either a priori (i.e.
independent of the values of f), or a posteriori. Numerical experience indicates that the
second approach provides more economical algorithms. For example, the following
branching technique has proved to be advantageous. At each iteration k, first select in the
search domain 'Xk a subdomain Zkl with the centre at the current optimum estimate. The
point corresponding to the record of f over 'Xkl Zkl is the centre of subdomain Zk2.
Statistical Inference 149

Similar subdomains Zkj (j=I, ... ,'L) are isolated until either 'Xk is covered, or the
hypothesis is rejected that the global maximum can occur in the residual set

1..
'Xk\ U Zk'
j=1 J

(the hypothesis can be verified by the procedure described in Section 4.2.3). This way,
the search domain 'Xk+l ofthe (k+l)-th iteration is

(k) 1..
Z = UZ k.,
j=l J

or a hyperrectangle covering Z (k>, or a union of disjoint hyperrectangles covering Z (k). In


the multidimensional cases, the latter two ways induce more conveniently realizable
variants of the branch and probability bound method, than the first one.
In contrast to the standard branch and bound algorithms, in these methods the
prospectiveness criterion may take any real value (e.g. for some criteria, the interval [0,1]
can be the set of values) rather than only two values, say 0 and 1. For example, an
estimate of MZ can be used as the prospectiveness criterion; it is natural to take the upper
confidence bound for MZ of a fixed level l-y, as the prospectiveness criterion of Z (this
bound can be computed through (4.2.25». The prospectiveness criterion that will be
discussed at the beginning of the following subsection (relying upon the procedure for
testing statistical hypothesis about MZ) is both natural and easily computable.
Rules for rejecting unpromising sets may also be diverse. Under reasonable
organization of branching and use of the apparatus of the preceding section for
constructing the prospectiveness criterion, there is no absolutely unpromising set Zoo
Narrowing of the search domain (i.e. rejection of subsets) therefore may occur only, if a
lower prospectiveness bound 0 is defined; if <p(Z)~ 0, then the set Z may be regarded as
unpromising and be rejected. Furtherly, it is intuitively evident that the more promising is
a subset, the greater number of sample points should be located in it. This can be assured
by taking, for example, the number of points in a subset to be directly proportional to the
value of the prospectiveness criterion. Note that the extremal approach in which all points
are always located in the most promising set, is not very good and basically does not
differ much from Algorithm 5.1.2. The following note can also be taken into
consideration when constructing the rejection rule. A reasonable prospectiveness criterion
may depend not only on function values over a given subset but also on those over the
whole set 'X. A subset of a mean prospectiveness can, therefore, become an unpromising
one (and be rejected), due to the increase of prospectiveness of other subsets. The
following numerical organization of the search algorithm is, thus, natural: if a subset Z
was recognized at the k-th iteration as being of medium prospectiveness (i.e. o<<p(Z)~ 0*,
for a suitable pair 0*> 0), then one shall not evaluate function f values in Z at several
subsequent iterations and wait; may be, Z will become unpromising and be rejected.
150 Chapter 4

4.3 .4 Typical variants of the branch and probability bound method


Below we shall consider one of the most natural and readily computable prospectiveness
criteria.
Let Mk * be the largest value of f obtained so far and let the search domain 'Xk be
covered at each iteration k by the subsets Zkj G=I, ...,'Lk):
'Xk c yZkj"
J

The prospectiveness criterion value on Zkj is defined as follows:

(4.3. 1)

where 11m and 11m-i are respective elements of the order statistics corresponding to the
sample (y ~=f(~~), ~ =1, ... ,m}, i is much smallerthan m, ~~ are independent realizations
of a random vector on 'X such that fall into Zkj- The value Pkj can be treated in two ways:
in the asymptotically worst case (for i=eonst, m-+oo) it is more than or equal to, first, the
probability that

MZ = sup f(x) ~M~


k'J XE
z kj

and, second, to the probability of accepting the hypothesis

under the alternative

and provided that the hypothesis is true and that the hypothesis testing procedure is that of
Section 7.3. In order to obtain (4.3.1) from (4.2.25) and (4.2.26) in which m, i, and Mk*
are substituted for N, k, and Mo, it suffices to solve with respect to 1-1 the inequality

which is basic in the formulas, and to substitute it by an equality.


Statistical Inference 151

In the algorithm below the number of points at each k-th iteration in the promising
subset Zkj is assumed to be (in the probabilistic sense) proportional to the value of the
prospectiveness criterion CPk(Zkj); further branching is performed over those sets Zkj
whose values (4.3.1) are not less than the given numbers ok.

Algorithm 4.3.1.
1. Set k=l, XI= X, Mo*=-co. Choose a distribution Pl.
2. Generating Nk times the distribution Pk, obtain a random sample

0;:'
..... -
k- {x(k)
1 , ... , x(k)
Nk } .

3. Evaluate f at the points of Sk and put

4. Organize the branching of Xk by representing this set as

where Zkj are measurable subsets of X having a sufficient number of points from Sk for
statistical inference drawing.
5. For each subset Zkj. compute (not necessarily through (4.3.1» the values of the
prospectiveness criterion CPk(Zkj).
6. Put

where

if CPk(Zkj) > Ok'


Z * ={Zkj
kj 0 if CPk(Zkj) ~Ok'

i.e. those subsets Zkj for which CPk(Zkj)5,Ok are rejected from the search domain Xk. Let
'Lk be a number of remaining subsets Zk{
7. Put
152 Chapter 4

(4.3.2)

where Pj(k)=I/Lk and Qkj are the uniform distributions on sets Zkt 0=1, ... , 'Lk). If CPk is
a nonnegative criterion, then we also may take

where

if

if

8. Return to Step 2 (substituting k+ 1 for k.


The closeness of Mk* to an estimate of M or to the upper bound of confidence interval
(4.2.26) is a natural stopping rule for Algorithm 4.3.1. Another type of stopping rule is
based on reaching a small (fixed) volume of the search domain Xk.
Of course, after terminating the algorithm, one may use a local ascent routine to make
more precise the location of the global maximizer.
The distributions Pk in Algorithm 4.3.1 are sampled by means of the superposition
method: first, the discrete distribution concentrated at (1,2, ... ,'Lk} with probabilities p/k)
is sampled, this is followed by distribution sampling Qkt where 't is the realization
obtained by sampling the discrete distribution. If the first procedure for choosing the
probabilities Pj(k) (i.e. Pj(k)=lI'Lk) is used at Step 7 of Algorithm 4.3.1 and the sets Zkt
0= 1, ... , 'Lk) are disjoint, then (4.3.2) is the uniform distribution over the set Xk+ 1.
All the points obtained at previous iterations and falling into the set X k can be
included into the collection 2k at Step 2 of Algorithm 4.3.1: this improves the accuracy of
statistical procedures for determination of the prospectiveness criterion value. If all the
distributions 11c are uniform on Xk , then all the resulting samples are uniform on Xk as
well; if the form of Pk+ 1 is (4.3.2), then, in general, the distributions of the resulting
samples are not uniform on Xk+ 1, but this fact is of no importance for drawing statistical
inference (see Chapter 7). It is not necessary, of course, to store all the previous points
and objective function values, because statistical inferences are made only using the points
where f is relatively large; one even does not need to know the number of points within
the domain.
After completion of Algorithm 4.3.1, one can apply (4.3.1) to the union of all rejected
subsets in order to determine the probability of not missing the global maximum. The
following should be taken into consideration. Let 'Ykj= I-Pkj be the probabilities of missing
Statistical Inference 153

global maximum in the rejected sets Zkj as computed via (4.3.1). The total probability of
missing the global maximum point, as determined by (4.3.1)is, then at most max 'Ykj-
Indeed, let us take the set Zkjo= Zkj which contains the point corresponding to the
statistics TIm of the set

Z= UZ k .•
k· J,J

Then Tlm-i for Z is not less than Tlm-i for Zkjo and

(it plays the role of Mk * for Z) is not less than Mk *. But Pkj defined via (4.3.1) is an
increasing function of both Tlm-i and Mk * for fixed a, i, and TIm .
The philosophy of constructing the branch and probability bound method is closely
related to that of those algorithms which are based on objective function estimation.
However, in the methods under consideration only functionals

M
Zkj
are estimated rather than the function f itself. But if one assumes that in Algorithm 4.3.1
~k=-oo for all k=I,2, ... , then one arrives at a variant of adaptive random search, in which
more promising domains are looked through more thoroughly. Algorithm 4.3.1 is then a
special case of Algorithm 5.2.1, where fk(X) for all the points ofZkj is equal to CPk( Zkj)
(if the subsets Zkj are disjoint for fixed k).
If the evaluations of f are costly, then one should be extremely cautious in planning
computations. After completing a certain number of evaluations of f, one should increase
the amount of auxiliary computations with the aim of extracting and exploiting as much
information about the objective function as possible. To extract this information one
should: compute the probabilities (4.3.1); construct various estimates of
Mz ;
kj
check the hypothesis about the value of the tail index a of the extreme value distribution
which can provide information whether a vicinity of the global maximizer is reached or
not; estimate the c.d.f. F(t) for the values of t close to M (this will enable one to draw a
conclusion about the approximate effort related to the remaining computations); and, in
addition, one can estimate f (and related functions of interest) in order to recheck and
update the information. Decisions about prospectiveness of subsets should be made
applying suitable statistical procedures. It is natural that these procedures can be taken into
consideration only if the algorithms are realized in an interactive fashion.
The major part of assertions made in this section have precise meaning only if for the
corresponding c.d.f. F(t) the condition (a) of Section 4.2 is met and the parameter a of the
extreme value distribution is known. As for the condition (a), practice shows that it may
154 Chapter 4

always be regarded as met, if the problem at hand is not too specific. In principle,
statistical inference about a can be made via the procedure of Section 4.2.4 that is to be
carried out successively as points are accumulated. However, it is recommendable to use
the results of Section 4.2.6 if possible, since the accuracy of procedures for statistical
inference about a is high only for large sample sizes. According to these results for the
case when 'XcRn and the objective function f is twice continuously differentiable and
approximated by a non-degenerate quadratic form in the vicinity of a global maximizer x*
one can always set a=n/2. While doing so one may be confident that statistical inferences
are asymptotically true for subsets ZC'X containing x* (together with subsets Z containing
maximizers

x~= arg ~ f(x)


xeZ

as interior points). The prospectiveness of other subsets Z may be lower than for the case
of using their true values of a, but this will not crucially affect the methods in question.
Let us finally describe a variant of the branch and probability bound method that is
convenient for realization, uses most of the above recommendations, and proved to be
efficient for a wide range of practical problems.
Algorithm 4.3.2.
1. Set k=l, 'Xl = 'X, Mo*=--00.
2. Sample N times the uniform distribution on the search domain 'Xk, obtain
Ek={x1 (k), ... ,XN(k) }.
3. Evaluate f(Xj (k) ) for j=l, ... ,N. Put

Check the stopping criterion (closeness of Mk* and the optimal linear estimator (4.2.15)
for M with a=n/2).
4. Set Yk,o= 'Xk, j=1.
5. Set Yk,j= Yk,j-1 \Zkj where Zkj is a cube (or a ball) of volume PJ..ln( 'Xk) centered
at the point having the maximal value of f among the points from the set Ykj-1 in which
the objective function is evaluated.
6. If the number m of points in Yk,j with known objective function values is
insufficient for drawing statistical inference (i.e. m~IDo), then set 'Xk+ 1= 'Xk and go to
Step 9. Ifm>IDo, then take
Statistical Inference 155

where the order statistics TIm, Tlm-i correspond to values of J at

k
Y k .n u S
,J t= 1 ~

If CPk(Yk,j)~O, then go to Step 8.


7. Substitute j+ 1 for j and go to Step 5.
8. Choose 'X k + 1 as the union of disjoint hyperrectangles covering the set
Zkl u . , ,uZkj-
9. Substitute k+l for k and return to Step 2.
Under the condition mo~oo Algorithm 4.3.2 converges with probability 1- 0, i.e. it
misses a global maximizer with probability not larger than B: this follows from the results
of Section 4.2.
Based un numerical experiments, the author proposes to use N=lOO, mo=15, p=O.l, and
i=min{5,[m/1O]}, where [ . ] is the integer part operation, as the standard collection of
parameters of Algorithm 4.3.2. (Other choices of the parameters, corresponding to the
recommendations of Section 4.2, are possible as well.)
156 Chapter 4

4.4. Stratified sampling


As it was pointed out in Section 3.1, the global random search algorithms consists of
iterations which involve random points sampled from some distributions. Independent
random sampling is the simplest standard way of obtaining random points: naturally it is
equivalent to the use of an independent sample. The aim of the present section is to
establish the possibility and advantage of using a stratified sample - instead of an
independent one - at each iteration of the algorithm in question.
The essence of stratified sampling is that the feasible region is divided into a number
of disjoint subsets and random points are generated independently on each one. The way
of stratified sample construction will be considered in Subsection 4.4.1: there the case of a
hypercube feasible region is studied in details.
Subsection 4.4.2 concerns statistical inference procedures for the maximum of a
function based on its values at the points of a stratified sample. The procedures are,
essentially, analogous to those considered in Section 4.2 for independent samples. The
use of stratified sampling is a standard way of decreasing the variance of Monte Carlo
estimators of integrals. While solving this problem, stratified sampling outperforms the
corresponding independent one and is admissible for a wide range of functions.
Subsection 4.4.3 establishes similar results for the global optimization problem.
The results of this section enable us to draw conclusions on a whole series of global
random search algorithms. The main conclusion is that decreasing the randomness leads to
improved efficiency (implying, of course, increased algorithmic complexity).

4.4 .1 Organization of stratified sampling

Let P be a probability measure on (X,:8) that is absolutely continuous with respect to the
Lebesgue measure Iln on X (in the important particular case P(dx)= Iln (dx)/lln (X) it is
the uniform distribution on X ). Define n=xN, 3=(xJ, ... ,xN) where xie X for
i=l, ... ,N. Assume that 3 is a random vector on n, denote its distribution by Q(d3), and
suppose that N=mf where m and f are integers.
Divide X into m disjoint subsets Xl,"" 'Xm of the same P-measure:

m
X = UX., X.e:B,
J
j=l J

The distribution P induces the distribution Pj on Xj as defined by Pj(A)=mP(An Xj) for


Ae :8. These distributions are probability measures on (X,:8), and if P is the uniform
distribution on X, then Pj is the uniform distribution on'S.
A stratified sample is a sample 3={xl, ... ,xN} that can be divided into m sub samples
3= 31 u ... U 3 m where each 3j G=l, ... ,m) consists of i independent realizations of the
random vector with the distribution Pj (i.e. 3j is an independent sample from Pj)' If the
distributions Pj G=l, ... ,m) are sampled sequentially, i times each, then the distribution of
the stratified sample 3 is
Statistical Inference 157

m
Q(dS)= ITP.(dx ij-i+l)P.(dx
j=l J J ij-i+2
) ... P.(dx
J ij
)..
In the particular case m= 1, the stratified sample is independent.
Sometimes it is convenient in practice to organize a stratified sample so that the
arrangement of elements of 3 is random. Under sequential sampling of random points
xiE 3, this is attained by means of uniform random choice of the distribution to be
sampled among the distributions Pt. ... ,Pm sampled less than i times.
We shall describe an economical way of stratified sample organization when the
feasible region X is a cube X=[O, l]n and the stratification consists of dividing X into
hyperrectangles Xj.
Let us represent the number m as a product m=ml ... m n where mi is a number of
intervals into which the cube X is divided by the i-th coordinate. Suppose that m=dPi,
where d, Pl> ... 'Pn are some integers. In the case the intervals have length l/mi, it is
convenient to correspond them to the ordered collections of Pi figures in the d-adic
representations (i.e. figures 0,1, ... ,d-1): this is connected with the fact that all points of
each interval have the same first Pi figures. A random point from an interval is easily
obtained by adding on the right-hand side some more figures corresponding to the d-adic
representation of a random number to the given collection of Pi figures.
Hyperrectangles Xj of volume 11m correspond to the ordered collections 8j from
p=p 1+···Pn d-adic figures: each of them 8=(u 1,... ,u p ) is naturally identified with a
number 8=u 1dP-l+U2dP-2+ ... +up. Let us show how multiplicative random number
generators should be used for obtaming pseudorandom numbers so that a stratified sample
would possess the mentioned property of the random arrangement of Xj-
Ifm=2P, i.e. d=2, then to get a number 8i we may take e.g. the generator

8. = A8. 1 (mod 4m)


J J-

where A=5 (mod 8), 8 0 =1 (mod 4) and in each number 8j only its first p digits are to be
used. As it follows from Ermakov (1975), p. 405, this generator has period length m. By
means of this generator the sets Xl, ... , Xm are chosen pseudorandomly until they all are
chosen. If i>l and it is required to choose pseudorandomly the sets Xl, ... , Xm again,
then the same generator can be used, with a new initial value 80 .
If m=10P, i.e. d=lO, then in order to obtain the numbers 8i we may proceed as
follows. Take a multiplicative generator Vj=AVj_1 (mod K) where A is a primitive root
with respect to the modulus K, K~m, K is the prime number nearest to m, and one can
take vo=K-l. This generator has the period K-l, i.e. pseudorandomly gives K-l different
numbers from the collection {l,2, ... ,K-l}. Those numbers Vj which exceed m should be
158 Chapter 4

omitted and for the remaining ones set ej=Vj-l. The amount of omitted numbers equals K-
m-l and does not exceed m-l (as it follows from the so-called Bertran postulate).

4.4.2 Statistical inference for the maximum of a junction based on its values at the
points of stratified sample

Let 'X be divided into m disjoint subsets 'Xjm (j=I, ... ,m) of equal volumes
Iln('Xjm)=lln('X)/m and let there be given ~ uniformly distributed points Xjl, ... ,Xj~ in
each subset 'Xjm. Then the sample

(4.4.1)

of size N=m ~ is stratified.


For the sake of simplicity we suppose that the above introduced measure P is uniform,
the objective function f is continuous and attains its global maximum at a unique point
x*=arg max f. Under these conditions, we shall construct procedures of statistical
inference for M=max f knowing the values of f at points of a stratified sample (4.4.1).
The procedures are similar to those for independent samples, given in Section 4.2.
Denote by 'X*m that (unknown) set from the collection of sets { 'Xlm,···, 'Xmm}
which contains the global maximizer x*. P*m is the uniform distribution on 'X*m; Po is
the uniform distribution on 'X, i.e. Po(dx)=lln(dx)/lln( 'X); x*J, ... ,x*~ are elements of
3m~ belonging to 'X*m; llm,l$....$.l1m,~ are the order statistics corresponding to the
sample (Ym,i=f(X*i), i=I, ... ,~}; further on

F*m(t)= J P*m(dx) (4.4.2)


J<x)<t

is the c.d.f. ofrandom variables Ym ,i, i=I, ... , ~;

F(t) = J P o(dx) (4.4.3)


f(x)<t

is the c.d.f. (4.2.2) corresponding to the uniform distribution Po on 'X; em,~ is the
(I-lIO-quantile of F*m(t) determined by the condition F*m(em,~)=I-lI~; em,~ is the
(I-limO-quantile of c.d.f. F(t) determined by F(em,~)=I-lI(mo.
From the theoretical point of view, the case when the number m of stratification of the
set 'X tends to infinity but the number of points ~ in each subset 'Xjm is constant, is of
most interest and will be considered below. .
Statistical Inference 159

The main asymptotic properties of the order statistics TIm ~-i, i=O,l, ... ,~-l are
presented in the following. '

Theorem 4.4.1. Let a functional set j=' consist of continuous function f given on X
such that the condition (a) from Section 4.2 for the c.d.f. (4.4.3) holds and the unique
global maximizer x* has a certain distribution R on (X,:8) equivalent to the Lebesgue
measure Iln on ( X, :8 ). Then for m-+oo, ~ =const, with R-probability one, the following
statements are valid:
a) the limit distribution of the random variable sequence

( TI m,t - M\/(M-
J
e mt ) (4.4.4)

has the c.dJ.

for u < 0,
(4.4.5)
for u~O;

b)

-I I/a
F. (v)-M- (M-
m
e m~
)[e(1-v)] (4.4.6)

for each v, O<v<l;


c)

M - ETJ
m,t-I
. - 'I'(e, lIa)(M - em~
)b.
1
(4.4.7)

for all i=O,l, ... ,t-l where

b.1 = r(l/a + i + l)/r(i + 1), (4.4.8)

'I'(e, u) = t ur(t + 1)/r(t + 1 + u), U >0; (4.4.9)

d)

2
E(TI m,t -I. - M\) (TI m,t -J"- M\) - 'I'(e,2/ a)(M - em )
~
A.""IJ (4.4.10)

for all O~$i$ t -1 where


160 Chapter 4

A.. = [(2/a + i + l)r(lIa + j + 1) /[r(lIa + i + l)r(j + 1)], i ~ j (4.4.11)


IJ

Proof It follows from the above listed conditions that with R-probability one, x* does not
fall into the boundary of the sets x,jm for all j.:s;m, m=1,2, ... : therefore we shall deal with
such events only during the proof.
Consider the f greatest values

from the collection

{ y.=f(x.),x.e3
1 1 1 m,~
}.

Since f is continuous and its global maximum is attained at the unique point x*, thus for
m~oo the points x(N), ... ,x(N_~+l) belong only to the sets x,*m. Therefore under the
given conditions the c.d.f. (4.4.2) and (4.4.3) are connected by the relationship

= m[P o{ x: f(x) < t} - (m - l)/mJ = mF(t) - m + 1. (4.4.12)

Hence and from the definition of the quantities em, f and emf there follows

e =e (4.4.13)
m) m~

Indeed, from the equality 1-F*m( em , f )=l/f and (4.4.12) we have

1 - mF(e ) + m - 1 = lIf, F(e ) = lI(mO


m,~ m,~

that is equivalent to (4.4.13).


Since F(t) satisfies the condition (a) from Section 4.2, thus by virtue of Theorem
4.2.1, we have

for u < o.

Whence there follows the chain of relations


Statistical Inference 161

~
-[mexp{-(-u)a/(~m)}-m+1J -

The asymptotic relation obtained is equivalent to the statement a) of Theorem 4.4.1. It may
be also rewritten like

for t::; M. (4.4.14)

But this implies (4.4.6), i.e. the statement b) of Theorem 4.4.1.


The statement c) can be proved analogously to the proof of Lemma 7.1.2. The
distribution density of the statistic 11m, i-i equals

i ~-i-1 i ,
P .(t) = iC F* (t)(I- F* (t)) F * (t).
m,~-1 ~-1 m m m
Thus

E11 .= Jtp . (t)dt=


m,~-1 -00 m,~-1

i ooJ ~-i-1 i
= ~C tF* (t)(I- F* (t)) dF* (t).
~-1_00 m m m

Introducing the variable u=F*m(t), we obtain

i 1 -1 . i
E11 = ~C JF * (u)u~ -1-1(1- u) duo
m,~-i ~-10 m

Furthermore, applying (4.4.6) we get

E11 -
m,~-i

- ~d
~-1
[Mf0 u~ -i-1(1_ u)idu - (M - e m~
)l1/a ju~ -i-1(1_ u)i+1/a. dUJ =
0
162 Chapter 4

The statement c) has been proved. Finally, statement d) is proved in analogy with Lemma
7.1.3. The joint distribution density of the random variables llm,t-i and llm,t-j for i2j
equals

p , . I .(t,s)
m,<-l, -J
=

~-i-l i-j-I j
=AF* m (t)F'* m(t)[F* m(s)-F* m(t)] F'* m(s)(l-F* m(s»),t:5:s,

where for brevity

A = nf + 1)/[nf - i)r(i - j)ru + 1)].

This way,

E(ll m,~ -i - M)(ll ml-j - M) =

s
= f ds f
-00 -00
(t - M)(s - M)p .
m,~-l,~-J
.(t, s)dt = I.

Changing variables by u=F*m(t), v=F*m(s) and applying (4.4.6) we get

(4.4.15)

where

1 v 1/0. j+l/o.. i-j-l


I1=fdvf(1-u) (I-v) u~-I-I(v-u) duo
o 0

Replacing the integration order in the integral I I and using the beta-function property

fam_I
t (a - t)
k-l
dt = am+k-1B(m,k), a,m,k > 0,
o
we have
Statistical Inference 163

1 lIa. 1 j+lIa i-j-1


I1=fdu(l-u) U~-1-1f(1-v) (v-u) dv=
o u

1 l / a ' l-u j+lIa i- '-1


=fdu(1-u) U~-1-1 f ((1-u)-r) r J dr=
o 0

= B(j + 1 + l/a,i - j)}u~-i-l(l- u)1/a(l_ u)i+lIadu =


o

= B(j + 1 + l/a,i - j)B(i + 1 + 2/a,f - i) =

r(j + 1 + l/a)r(i - j) r(i + 1 + 2/ a)r(~ - i)


r( i + 1 + 1/ a) r( f + 1 + 2/ a)

Substituting A and the expression obtained for 11 into (4.4.15), we get (4.4.10). The
theorem is proved completely.
It follows from the theorem that the objective function records corresponding to
stratified samples are asymptotically subject to probabilistic laws similar to those which
hold for the case of independent samples. It is easily seen that for f ~oo the mentioned
asymptotic laws coincide. In particular, <Pa,oo(u)= <Pa(u) where <Pa and <Pa,~ are
determined by (4.2.14) and (4.4.5), respectively. This implies that the limit distributions
of record values for independent and stratified samples for m~, t ~oo coincide.
The conditions of Theorem 4.4.1 are slightly stricter, than the condition (a) of Section
4.2: under condition (a), the record values for an independent sample are subjected to
analogous asymptotic relations. The additional condition requires the existence of a
probability distribution R for x* which is equivalent to the Lebesgue measure on X. The
distribution can be regarded to as a prior distribution for the maximizer, and thus the set of
:r
objective functions is regarded as being stochastic. This requirement is necessary in
order to assure that only such events be considered when x* does not hit the boundary of
any set X*m' In practice, the number m is finite: thus the requirement is not very
restricted, so much the more that the distribution R does not occur in the formulas.
The availability condition for a prior distribution R for x* may be replaced by the more
explicit, but more restrictive one:

inf f(x) ~ sup f(x).


XEX*m XEX\X*m

In general, this condition can be satisfied with some error only. It is satisfied with a high
accuracy for the case where 2. is a IT-r-grid (see Section 2.2.1). Although the hitting of
164 Chapter 4

elements of a nt-grid into the set X*m cannot be considered as uniformly distributed in
the probabilistic sense, a stratified sample pattern is a good approximation for a ~-grid.
To estimate the parameter M, we shall use linear estimators of the type

k
M = La.T\ m,t-i
m,k i=O 1
(4.4.16)

where kS ~ -1 and ao,a 1,... ,ak are some real numbers defining the estimate.
From (4.4.7) it follows that under the fulfilment of the conditions of Theorem 4.4.1,

EM k-Ma'A. - (M- 8 )'¥(~, l/a)a'b, m ~oo, (4.4.17)


m, mt

is valid where

A. = (1, 1, .. , , 1) I ,

the values bi are defined by (4.4.8) and the function '¥ by (4.4.9). Now, it follows from
(4.4.10) that if the mentioned conditions are met, then

(4.4.18)

holds where A is a symmetric matrix of order (k+ 1)x(k+ 1) with elements A.ij defined by
fonnula (4.4.11) for oQSisk.
Since f is continuous, the c.d.f. F(t) and F*m(t) are continuous and therefore
8 m ~ '" M, M-8 m ~ ~O for m~oo and each integer ~. Applying now the Chebyshev
inequality

and (4.4.18), we can conclude that under the fulfilment of the conditions of Theorem
4.4.1 the estimators (4.4.16) converge to Ma'A. for m~oo. Hence, the sequence of
estimators (4.4.16) is consistent if and only if the relation

k
a'A. = La. = 1 (4.4.19)
i=O 1

coinciding with (4.2.16) is satisfied.


If the conditions of Theorem 4.4.1 are satisfied and a> 1 then for consistent estimators
Mm,k formulas (4.4.17) and (4.4.18) have the fonn
Statistical Inference 165

EM k-M-(M-8 )'I'(f,l/a)a'b,
m, ml
( 4.4.20)

(Note that (4.4.19) is called the consistency condition and

k
a'b = La .f'(l/a + i + 1)/r(i + 1) = 0 (4.4.21)
i=O 1
is called the unbiasedness requirement.)
From (4.4.20) follows that a natural criterion for the optimal selection of parameters a
is the quantity a'Aa which is to be minimized on the set of vectors a satisfying either
restriction (4.4.19) or (4.4.19) together with (4.4.21). These optimization problems are
similar to those which will be treated in Section 7.1, for the case of independent sampling.
For instance, the vector a=a* determining the optimal consistent linear estimator is
detennined by (4.2.18).
Confidence intervals for M can be constructed by consistent estimates of the type
(4.4.16) applying the asymptotic inequality

m~oo,

which follows from (4.4.20) and the Chebyshev inequality. Another way of constructing
confidence intervals for M is based on the following statement.
Lemma 4.4.1. Let the conditions of Theorem 4.4.1 be fulfilled. Then the sequence of
statistics

Km =(M-11 m,l )/(11 mJ -11 m,l-l )


converges in distribution to a random variable Xl with the c.d.f. F 1(u)=(u/(l +u»a, u>O,
for m~oo.

Proof. For the order statistics 11m,f-i, i=O,l, ... ,f-l an analogy of the Renyi
representation

is valid. Here ~o, ... , ~i are independent random variables distributed exponentially with
the density e- u , u>O. Combining this with (4.4.6), we have
166 Chapter 4

M-ll
m,f
_.-(M-e
I mf
)[f(l-exp{- .~~'/(f-j)}J~lIa,m~oo.
J=O J ~
(4.4.22 )

For brevity write w=(l + l/u)<X and introduce

D 2 = {( x T y 2): 0 < x 2 $; 1, 0 < y 2 $; 1, x 2 (w - y 2) > w - I}


D 3 = {(x,y): 0 < x$; 1, 0 < Y $; 1, Y < wx + 1- w}.

Taking into account that w>l, applying the asymptotic equation (4.4.22) for i=O, 1 and
changing variables

x 1 =xlf, Yl=ylU- 1),


-x -y
x 2 =e 1, Y2=e 1,
x=x 2' y=x 2 Y2'

we obtain for m~oo the relations

= f(f -1) f x t - 1 y f- 2dx dy = f(f -1) f yt-2 dxdy =


D 2 2 2 2 D
2 3
Statistical Inference 167

= i(~ - 1) J (WX+1-W
1
J yt -2 dy ) dx =
1-1/w 0

1
a
=t J (wx+1-w)
t~
dx=l/w=(u/(u+1)).
l-l/w

The lemma is proved.


It follows from Lemma 4.4.1 that the way of constructing a confidence interval for M
by the two maximal order statistics is identical to that used in the case of an independent
sample. E.g., a one-sided confidence interval of confidence level 1-"( can be chosen as

[ Tl
m,t
,Tl
m,t
+(Tl
m,t
-Tl
m,t-1
)/((1_"()-lI(J._l)].
It follows from (4.4.7) that the mean length of this confidence interval for m-7 OO
asymptotically equals

(4.4.23)

Note that this quantity is 11'P( ~, l/a) times smaller, than the asymptotic mean length of the
analogous interval constructed through an independent sample. It also follows from
Lemma 4.4.1 that the statistical hypothesis testing procedures for M using two maximal
order statistics coincide for stratified and independent samples. (We remark that an attempt
failed to generalize Lemma 4.4.1 for the case k> 1: therefore the corresponding confidence
intervals and statistical hypothesis tests for k> 1 were not constructed for the case of
stratified sampling.)

4.4 .3 Dominance of stratified over independent sampling

We shall show first that stratified sampling dominates the independent one, when the
quality criterion is the record value (current optimum estimate) of the objective functiQn.
Let f'=C(X) be the set of continuous functions on X and P be an absolutely
continuous with respect to the Lebesgue measure probability measure on (X,B). Define

K[f] = max_ f(x 1.).


X.E'::'
1

Let N=m f, where m and t are natural numbers and a vector 3 be chosen in D at random
in accordance with a distribution Q(d3). We shall call an ordered pair rc=(K[f],Q) a global
random search procedure for optimizing f E f' on X.
168 Chapter 4

According to the general concepts of domination, we shall say that a procedure 1t


dominates a procedure 1to in f' if <Pf(1t);:::<Pf(1to ) for all fe f' and a function ge f' exists
such that <Pg (1t» <Pg(1to)' Here <Pf(1t) is a quality criterion, chosen below as

<P f(1t) = Pr{K[fl ~ t}, t e (min f, max f). (4.4.24 )

Note that this is a multiple criterion; the strict inequality

implies that in the probabilistic sense a record K[ f ] of a function f is closer to M=max f,


when the procedure 1t is used than for 1to.
Denote by 1tm =(Km[.],Qm) a random search procedure based on stratified sampling
with m stratifications x,J, ... , Xm and I random points in each strata. The procedure 1tl
corresponds to independent sampling from X. Compare the procedures 1t m and 1t 1,
applying the criterion <Pf(1t).

Theorem 4.4.2. The procedure 1t m for m> 1 dominates the procedure 1tl in C(X), that
is

(4.4.25 )

for all fe C(X), te (-00,00) and there exists a function ge C(X) such that

for all te (min g, max g).

Proof. Denote by At=f -1 {(-oo,z)} the inverse image of the set (-oo,t) for mapping f and let

11(0)= max f(x.)


1 xoE3n'Xo J
J 1

be the record of f in 'Xi, We have


Statistical Inference 169

where

~ i = P i( At) = mP( Atnx i ) for i = 1, ... ,m


and Pi are defined as in Section 4.4.1.
Since (Xl,"" Xml is a complete set of events, therefore
m
P( At) = i~lP( AtnX i )
Thus

The inequality (4.4.25) follows now from the classical inequality between arithmetic and
geometric averages

11m

~ i LI3.
=1
~ (rrl3.)
1 i =1 1
. (4.4.26)

We know that (4.4.26) is valid for arbitrary nonnegative numbers ~l, ... , ~m and
becomes equality only for case of 131 = ... = ~m: we shall show that there exists such a
function fE C(X) for which not all ~J, ... , 13m are equal (consequently, the inequalities
(4.4.25) and (4.4.26) are strict). Choose a function f that is not equal to a constant, but
identically equals min f on the set Xl. For such a function f, for each tE (min f, max f)
one has
170 Chapter 4

m m
P(A t ) = .LP(A{'\Xi) = ~.L~i <1,
1=1 1=1

Therefore some of the quantities ~i differ and the inequality (4.4.25) is strict for each
tE (min f, max f). The proof is completed.
It should be noted that the above result does not use asymptotic extreme value theory
and is rather general, i.e. valid under nonrestrictive assumptions. Moreover, it follows
from the proof of the theorem that the set of continuous functions may be replaced by
other, more narrow classes of functions, e.g. by

The following results on the domination of stratified over independent sampling are
based on Theorem 4.4.1, together with the next statement.

Lemma 4.4.2. The function 'I'U ,v) determined by formula (4.4.9) for each v>O is
strictly increasing in f for f~l, 'I' (l,v)<1 and

lim 'I'(f, v) = 1.
f-7 OO

Proof. We have

'I'(l,v) = 1/r(2 +v) < 1


and by the Stirling formula

. . ~vr(~ + 1)
lim 'I'(f, v) = lim ru + 1 + v) =
f-7 OO f-7 OO

Let us show finally that the function 'I'(f,v) is a strictly increasing in f, i.e. for all
v>O, f=1,2, ... the inequality

'I'(f + 1, v)/'I'(f, v) > 1

holds. Indeed
Statisticalln!erence 171
v
'¥(~ + 1, v)/'¥(~, v) = (1 + 1/0 /(1 + v/(~ + 1»),
'¥(t + 1, v)
\jf(v)=log '¥(t, v) =vlog(l+lIO-log(I+v/(t+I»),

\jf'(v) = log (1 + lI~) - lI(t + 1 + v) > O.

Therefore the function \jf(v) is monotone increasing for v>O and \jf(O)=O, whence \jf(v)>O
for each v>O. Now it follows that

'¥(t + I,v) '¥(t+ l,v)


\jf(v) = log '¥(t, v) > 0, '¥(t, v) > 1.

The lemma is proved.


Based on Theorem 4.4.1 and Lemma 4.4.2, we shall show below that for m~oo the
random search procedure 1tm using a stratified sample dominates the corresponding
procedure 1tI that uses an independent sample, with respect to criterion <l>f( 1t) constructed
on the basis ofk+l (05k.$.t-I) records of f.
Set

(0)
K[f] = ( K [t], ... ,K
(k»)
[t]

where K(i)[f]=11N-i> 11 1.$.....$.11N are the order statistics corresponding to the sample
Y={Yi= f(Xi), xiES }.
For comparing global random search strategies we shall choose the vector criterion

(4.4.27)

where O.$.k.$.t-l,

Proposition 4.4.1. If the conditions of Theorem 4.4.1 hold, then the procedure 1tm
dominates the procedure 1tl with R-probability one.
The proof consists of using Lemma 7.1.2 (proved in Chapter 7) and (4.4.7) from
which it follows that with R-probability one, the asymptotic relation

(M - EK~[f]) /(M- EK ~i)[f] )-'¥(t, lin), (4.4.28)


172 Chapter 4

holds and applying Lemma 4.4.2 according to which the right hand side of (4.4.28) is
less than 1.
From Proposition 4.4.1 follows that while applying stratified sampling the records of
the objective function are closer to the value M=max f and, thus, using them more
accurate statistical inference for M can be constructed. As it follows from (4.4.28) the
best with respect to the criterion (4.4.27) is the stratified sampling with 1=1, i.e. with a
maximal degree of stratification. This finding suggests that decreasing randomness, in
general, improves the efficiency of a global random search procedure.
Other quality criteria for global random search procedures n; are possible: let us give
three further such criteria and obtain for them results, analogous to Proposition 4.4.1.
As for K[f], we choose now a consistent linear estimate of the parameter M,
constructed on the base of the k+ 1 (O:::;k:::; f -1) maximal order statistics 11N,'''' 11N-k with
fixed coefficients ao, ... ,ak (satisfying the condition (4.4.19), i.e.

k
K[f] = 2, a i 11 N-i' ( 4.4.29)
i=O

If we select the bias of (4.4.29) as <l>f(n;), i.e.

( 4.4.30)

then by virtue of Lemma 7.1.5 and (4.4.17), under the fulfilment of the conditions of
Theorem 4.4.1 and the unbiasedness condition (4.4.19), with R-probability 1, the
asymptotic relation

(M - EK mUD/eM - EK l[t]) - \fI(f, 1/a), m --*00, ( 4.4.31)

holds, analogously to (4.4.28): its consequences are identical to those of Proposition


4.4.1, replacing the criterion (4.4.27) by (4.4.30).
Now let K[f] be defined again by (4.4.29) and the criterion <I> be the mean square
error of the estimator (4.4.29), i.e.
2
<I> fen;) = E(K[fJ - M)

Then, by virtue of (4.2.17) and (4.4.20), under the conditions of Theorem 4.4.1, there
holds

m --* 00,

with R-probability one. Due to Lemma 4.4.2, <l>U,2/a)<1 and thus the procedure n;m
dominates the procedure n;1, according to this criterion.
Consider, finally, as K[f] the confidence interval of confidence levell-y
Statistical Inference 173

for M, and its mean length as <l>f(1t) . By virtue of Lemma 7.1.2 and (4.4.23), if the
conditions of Theorem 4.4.1 are met, then with R-probability 1 the relation

m --too,

is valid. Its consequences are identical to those formulated in Proposition 4.4.1 (with the
indicated replacement of criterion <l>f(1t)).
Some results of this subsection (namely, Theorem 4.4.2 and relation (4.4.31» were
formulated in Ermakov et al. (1988) which contains also the following formulation of the
result concerning the admissibility of stratified sampling.

Proposition 4.4.2. Let K[f], <Df(1t) and:F be the same as in Theorem 4.4.2, 1'1. be the
set of all probability measures on D=X N , m=N, 1=1. Then the procedure 1tN=(K[f]'
QN) corresponding to stratified sampling with maximal stratification number is admissible
for the functional class :F in the set II= {1t=(K[f],Q), QE 1'1.} of all global random search
procedures: in other words, there is no procedure 1tE II such that 1t dominates 1tN.
The proposition states that the global random search methods using stratified sampling
can not be improved for all functions fE :F simultaneously.
This result is similar to the one of the admissibility of Monte Carlo estimates of
integrals, see Ermakov (1975). For brevity,we shall not give a complete proof of the
above proposition but present only its main ideas.
The proof of Proposition 4.4.2 uses the proof of Theorem 4.4.2 and contains two
stages. The first one shows that the distribution Q(d3) of the procedure 1t that dominates,
perhaps, the procedure 1tN has the uniform marginal distribution of components of a
vector 3 and this distribution may be chosen symmetric. From the existence supposition
of f from:F such that for a symmetric Q(d3) with uniform marginal distribution of
components, the probability Pr{K[fl2:t} is strictly larger than

for ~ = 1, m=N and some t, the existence of gE:F such that for each tE (min g, max g) an
inverse strict inequality holds, is deduced at the second stage. For details, see Ermakov et
al. (1988).
174 Chapter 4

4.5 Statistical inference in random multistart


Random multi start is the global optimization method consisting in several local searches
starting at random initial points. It is the most well-known representative of the multistart
technique described in Section 2.2.2. Random multistart is inefficient in its pure form,
since it may waste much effort for repeated ascents (resp. descents). But some of its
modifications, using cluster analysis procedures for preventing repeated ascents are rather
efficient: they were already discussed in Section 2.2.2. This section follows mainly
Zielinski (1981) and presents statistical inferences in random multi start methods that can
be useful for controlling it as well as its modifications.

4.5 .1 Problem statement

Let f be a continuous function on xcR n, ~ be an unknown number of local maximizers


Xl * ,... ,X ~ * of f which are supposed to be isolated, P be a probability measure on ( X,:B)
(e.g. P(dx)=lln(dx)/ Iln( X », and.A. be a local ascent algorithm. We shall write
.A.(X)=Xi* for XEX, if - whenever starting at the initial point x - the algorithm .A. leads to
the local maximizer Xi *.
Put 8i=P( Xi *) for i=I, ... ,~ where

is the region of attraction of xi*. The value 8i will be referred to as the share of the i-th
local maximizer xi* (with respect to the measure P). It is evident that

~
8.I >0 for i = 1, ... , ~ and L8.=1.
i=l I

A random multistart method is constructed as follows. An independent sample


3={xl, ... ,xN} from the distribution P is generated and the local optimization algorithm
.A. is sequentially applied at each XjE 3. Let Ni be a number of points Xj belonging to Xj *
(i.e. Ni is a number of ascents to xt among .A.(Xj), j=l, ...,N). By definition
~
N. ;;::0
I
i = 1, ... ,f, LN.=N,
i=l I

and the random vector (N 1> ... ,N~) follows multinomial distribution
Statisticalln!erence 175

where

~
I. n. = N, N
( n 1,···,n ) N! n. ~ 0 (i=l, ... ,£).
= n' n I'
i=1 I ~ r'" ~.
I

The problem is to draw statistical inference on the number of local maximizers f, the
parameter vector 8=( 81,"" 8 r), and the number N * of trials that guarantees with a given
probability that all local maximizers are found.
A main difficulty consists in that f is usually not known. If an upper bound for £ is
known, then standard statistical methods can be applied; the opposite case is more difficult
and the Bayesian approach is applied.

4.5.2 Bounded number of local maximizers

Let U be a known upper bound for f and N~U. Then (N I/N, ... ,N rIN) is the standard
minimum variance unbiased estimate of 8 where NilN is the estimator of the share of the
i-th local maximizer xi*' Of course, for all Nand f> 1 it may happen that Ni=O but 8pO.
So, the above estimator nondegenerately estimates only the share 8i such that Ni>O.
Let W be the number of Ni's which are strictly positive. Then for a given f and
8=(81,"" 8 f) we have

For instance, the probability that the local search will lead to the single local maximizer is
equal to

~ N
Pr{W=118}= I.8.,
i=1 I

furthermore, the probability that all local maxima will be found equals

Pr{W=fI8}= I.
N
( n, ... ,n
)nl
8
1 ... 8
n~
. (4.5.1)
n 1+. - +n I =N 1 ~ ~
n.>O
I

The probability (4.5.1) is small if (at least) one of the 8i'S is a small number, even for
large N. On the other hand, for any f and 8 we can find N* such that for a given qE (0,1)
176 Chapter 4

we will have Pr{W=~ I O}~q for all N~N*. The problem of finding N*= N*(q,O) means
to find the (minimal) number of points in oS such that the probability that each local
maximizer will be found is at least q.
Set

and note that

Hence the problem of finding N*(q,O) is reduced to that for N*(q,O*) where
0*=( ~ -1 , ... , ~ -1 ). The latter is easy to solve for large N as

~ .. N
=I (_I)lC1(l-iiO -exp{-~exp{-N/n}, N -too.
i=O ~

Solving the inequality exp( -~ exp(-Nt m~q with respect to N we find that

For instance, for q=0.9 and ~=2, 5, 10, 100 and 1000 the values of N*(q,O*) are equal to
6, 19,46,686 and 9159, respectively. (Of course, N*(q,O*) is greater than L)

4.5.3 Bayesian approach

Let there be given the prior probabilities CXj 0=1,2, ... ) of events that the number ~ of local
maximizers of f equals j together with conditional prior measures Aj(dai) for the parameter
vector OJ=(Ol, ... ,Oj) under the condition ~=j. We shall assume that the measures Aj(dai)
are uniform on the simplices
Statistical Inference 177

S. > 0,
1
t
i=I
S.
1
= I}.

Thus, the parameter set e, on which the unknown parameter vector 8=(81, ... ,8~) can
take its values, has the form
00

8= u8.
j=I J

and the prior measure ,,-(dS) on 8 for S equals

(4.5.2)

It is natural to assume that "- is a probability measure.


Let d=d(N 1, ... N w) be an estimate of L The estimate

is called optimal Bayesian estimate of ~; after some calculations it can be simplified to

d* = arg max a .Q(j,W,N) (4.5.3)


j;,~W J

where

Q(j,W,N) = C ~r(j)/r(N + j).


J

Applying a quadratic loss function, the optimal Bayesian estimate for the total P-measure
of the domains of attractions of the hidden ~ -W local maximizers (i.e. of the sum of the
Si's corresponding to the hidden maximizers) is given by

00' W 00

2. IN- . a .Q(j, W,N)/ 2. a.Q(j,W,N) .


. W
J=
+J J . W J
J=

The optimal Bayesian procedure for testing the hypothesis Ho: ~=W under the alternative
HI: bW is constructed in a similar way. According to it, Ho is accepted if
00
178 Chapter 4

otherwise Ho is rejected. Here Co 1 is the loss arising after accepting Ho in the case of
HI'S validity and clO is the analogous loss due to accepting the hypothesis HI that is
false.
The above statistical procedures were numerically investigated in Betr6 and Zielinski
(1987). In some works (see Zielinski (1981), Boender and Zielinski (1985), Boender and
Rinnooy Kan (1987)) the procedures were thoroughly investigated and modified also for
the case of equal prior probabilities aj (i.e. for the case aj=l, j=I,2, ... ). They are not
presented here, since the equal prior probabilities assumption contradicts the finiteness of
the measure (4.5.2) and is somewhat peculiar in the global optimization context (for
instance, according to it, the prior probabilities of f having 2 or 1010 local maximizers
coincide). Instead of these, we shall present below the following result.

Proposition 4.5.1. (Snyman and Fatti (1987)). Let the condition

e* = max e.
l~j~i J

hold, where e* is the share of the global maximizer, and a prior distribution for e* be beta
distribution. Then the inequality

Pr{ M~ = M} ~ 1- r(N)r(2N - r + 1)/[r(2N)r(N - r + 1)]

holds for the probability of the event that the record value MN* in the above described
random multi start method equals M=max f: here r is the number of sample points XiE:::
falling into the region of attraction of the maximizer with the function value MN*'
Statistical Inference 179

4.6 An approach based on distributions neutral to the right


This section follows mainly Betr6 (1983, 1984) and describes statistical inferences for
quantiles of the c.d.f. (4.2.2). i.e.

F(t) = F f(t) = J P(dx) (4.6.1)


f(x)<t

under the supposition that this c.d.f. belongs to a subclass of the class of cumulative
distribution functions neutral to the right, with a view on applications to global random
search theory: essentially, the basis of the approach is a specific parametrization of the
c.d.f. (4.6.1).

4.6.1 Random distributions neutral to the right and their properties


A random c.dJ. F(t) is said to be neutral to the right, if for each In> 1 and

- 00 < tl < ... < tm < 00 (4.6.2)

there exist nonnegative independent random variables ql, ... ,qm such that the random
»
vector (F(tl), ... ,F(tm is distributed as

Some properties of c.d.f.'s neutral to the right, are presented below without proofs.
Pl. IfF(t) is neutral to the right, then the normalized increments

are mutually independent for each collection (4.6.2) such that F(tm-l)<1.
P2. A wide c.d.f. class can be approximated by c.d.f.'s neutral to the right.
P3. The posterior c.d.f. of a c.d.f. neutral to the right is neutral to the right.
P4. Most c.d.f.'s F(t) neutral to the right are such that the posterior c.d.f. F(t I Y),
given a sample Y={Ylo ... ,YN} from F, depends not only on the number NA of
x's that fall into A but also on where they fall within or outside A.
P5. A random c.d.f. F(t) is neutral to the right, if and only if it has the same
probability distribution as l-exp{ -1;(t)} for some a.s. nondecreasing, right-
continuous, independent-increment random process 1;(t) with

lim 1;(t) =0 a.s. lim 1;(t) = 00 a.s.


t--+ -00 t--+oo

Within the class of c.d.f.'s neutral to the right, apart from rather trivial cases, only the so-
called Dirichlet processes do not hold the property P4. For this reason, the posterior
distributions of these processes are easy to handle and are more widely used in
180 Chapter 4

applications (for instance, in some discrete problems generated by global random search,
see Betra and Vercellis (1986), Betra and Schoen (1987)). But the Dirichlet processes are
too simple and thus unable to approximate the c.d.f. (4.6.1) accurately enough. To this
end, another subclass T of the c.d.f.'s neutral to the right will be considered.
A neutral to the right c.d.f. F(t) is an element of T if the corresponding random
process

~(t) =-log(1- F(t)) (4.6.3)

is a gamma process, i.e. ~(t) has gamma-distributed independent increments.


For a gamma process ~(t) the moment generation function

v>O, (4.6.4)

has the form

(4.6.5)

where Ais a positive number and yCt) is a nondecreasing function satisfying

lim yet) = 0, lim y( t) = 00.


t~-= t~oo

The expression (4.6.3) and (4.6.4) imply the representation

for the moments of a neutral to the right c.d.f. F(t). For FE T, together with (4.6.5), this
gives the expression

y (t)
E{[1-F(t)]m} = [1../(1.. +m)] , m= 1,2, .... ( 4.6.6)

Denoting the prior c.d.f. of F(t) by ~(t)=EF(t), for FE T we obtain

y(t)
1- ~(t) = (1..1(1.. + 1))
from (4.6.6) with m=l. This yields the representation

yet) = (log (1- ~(t))) Ilog (1../(1.. + 1)) (4.6.7)

for yet).
Statistical Inference 181

In order to see the role of parameter A. consider the variance of 1-F(t) which is
represented as

g (A.) 2
var(l- F(t») =(1- (3(t») - (1- P(t») (4.6.8)

by (4.6.6) for m=I,2 and (4.6.7): here

g(A.) = (log (A./(A. + 2»))/log (A./(A. + 1»).


This is an increasing function, g(A.)~ 1 for A.~O, and g(A.)~2 for A.~oo. Thus, larger
values of A. corresponds to smaller values of the variance (4.6.8), and smaller values of A.
correspond to larger values of (4.6.8), i.e. A. measures the prior strenght of belief about
P(t).
The following proposition formulates an important feature of T, viz., for each Fe T
the characteristic function of the posterior distribution of ~(t)=-log(l-F(t», given a sample
from F, can be represented in an analytical form.

Proposition 4.6.1. Let Fe T, 111 , ... ,11N be the order statistics corresponding to an
independent sample Y={Yl, ... ,YN} from F. Set mj=N+A.-j, 110=-00,

If the moment generation function for the process (4.6.3) has the form (4.6.5) and ')'(t) is
continuous at the points Yl> ... ,YN, then the moment generation function corresponding to
the posterior distribution is equal to

(4.6.9)

Since for each t the characteristic function of a random variable ~(t) is

(4.6.10)

thus (4.6.9) presents the analytical form for the characteristic function of the posterior
distribution of a gamma process ~(t)= -log(1-F(t». This way one can obtain the posterior
distribution of F by numerical evaluation of a Fourier integral. Once a posterior
182 Chapter 4

distribution ofF(t), given the sample, is known, testing some statistical hypothesis about
F can be performed in the framework of Bayesian statistics. We shall now describe how
the statistical hypothesis about quantiles ofF are tested.

4.6.2 Bayesian testing about quantiles of random distributions

It is shown here that in a natural setup the problem of testing a statistical hypothesis about
random c.d.f. quantile is reduced to a single computation of the posterior probability.
Let F(t) be a random c.d.f., tp be the p-th quantile of F, Y={Yl, ... ,yN} be an
independent sample from F, t * is a given constant. The problem is to test the hypothesis
Ho: tp~t* which can be rewritten as Ho: F(t*)~p. Let dey) be a decision function
assuming two values do and d 1, corresponding to acceptance and rejection of H o, and the
losses connected with do and d 1 are

where Co and CI are given positive values. The posterior mean values of L(F,di) are

E{L(F, do) I Y} = coPr{tp > t* I Y},

E{L(F,d 1) IY} =c 1Pr{tp ~t* IY}.

Thus the optimal decision function is

(4.6.11)
otherwise.

To construct d* one needs to evaluate the posterior c.d.f. at t=t*.

4.6.3 Application of distributions neutral to the right to construct global random


search algorithms

It is shown below how the result cited can be applied for controlling the precision of some
simple global random search algorithms (like Algorithm 3.1.1).
Let P be the uniform measure on 'X, 3={Xlo... ,xN} be an independent sample from
P, f be a bounded measurable function, M=vrai sup f, f * be the record, and the
maximization problem of f is stated in terms of obtaining a point from A(e)={x:M-f(x)g}
where e is a given value. Note that

(4.6.12)
Statistical Inference 183

is an independent sample from c.d.f. (4.6.1) and

The statistical problem is to test the hypothesis Ho: f *E A(e) that can be rewritten in the
forms Ho: F( f *)~1- 10 and Ho: f *~ f 1-10 where f 1-10 is the (l-e)-quantile of the c.dJ. F(t)
defined by (4.6.1).
Underthe assumption that F is a random c.d.f., the decision rule (4.6.11) can be used
to test the hypothesis Ho (it is natural to choose e.g. co=q =1). In order to apply the
results of Section 4.6.1, we suppose that FET and consider the choice problem for
parameter A and function -y(t) that determine the gamma process S(t)= - log (I-F(t».
Using (4.6.7) we obtain for each tE R, OE (0,1)

Pr{F(t) ~ 1 - o} = 1- r( - Alog 0, - (log (1- ~(t»))/ log (1 + 1IA))


(4.6.13)

where ~(t)=EF(t) is the prior c.d.f. corresponding to F(t) and

u
r(u,v) = f t v-I e- t dt/r(v).
o
Before drawing statistical inference, we demand the following prior information: let for
some pair a, OE (0,1) a value f o,a be given such that

(4.6.14)

It is natural to use e.g. a=0.5, 0=10.


Equalities (4.6.13) and (4.6.14) imply the equation below for A, given ~(t):

1- a = r( - Alog 8, - (log (1- ~(f o,a» )/log (1 + 1IA). (4.6.15)

It may happen that (4.6.15) is unsolvable. In this case either (4.6.14) or ~ should be
modified. The properties of the incomplete gamma function imply that the equation
(4.6.15) has solution if 1- ~<f o,a)~o(1-a).
As for ~(t), it is natural to set it in a parametric form with subsequent estimation of the
parameters on the basis of the empirical c.d.f.
184 Chapter 4

For instance, if the parameters of ~ are the mean Jl and variance (J2, i.e.

(4.6.16)

where ~* is a given function, then

-1 N
11 =N LY·, (4.6.17)
i=1 1

are natural estimates for Jl and (J.


It should be noted that it is inappropriate to take FN(t) as ~(t), since in this case F(t)=1
for all t~T\N and hence F(f*)=I, i.e. f* ~h-E and it is meaningless to test the hypothesis
H o·
Let us formulate a typical global random search algorithm based on statistical
hypothesis testing for the quantiles of c.d.f. (4.6.1) under the assumption FE).

Algorithm 4.6.1.

1. Assume that the prior c.d.f. ~(t) for F(t) has the form of (4.6.16).
2. For some a., 5E(O,I), take f5,a. satisfying (4.6.14).
3. Obtain a sample 3={xt.... ,xN} by sampling a distribution P.
4. Evaluate Yi=f(Xi) for i=1, ... ,N.
5. Estimate Jl and (J by (4.6.17)
6. Find A. by solving (4.6.15). If the equation (4.6.15) has no solution, then increase
(J until it becomes solvable.
7. Determine ')'(t) from (4.6.7).
8. Obtain f * (for instance, set f * =T\N=max {Yl ,···,YN})·
9. Estimate the value Pr{F( J* )~I-E IY} by numerical evaluation of a Fourier integral
from the characteristic function determined by (4.6.9) and (4.6.10).
10. Test the hypothesis Ho: J*EA(E) by (4.6.11).
11. If Ho is accepted, then terminate the algorithm, otherwise sample No times the
distribution P and return to Step 4 substituting N+No for N (No is some fixed natural
number).

Let us comment on the essence and possibility of applications of Algorithm 4.6.1.


The essence of Algorithm 4.6.1 is in setting a particular parametrization of the c.d.f.
(4.6.1): it consists of using (4.6.5), (4.6.7), and a parametrization for ~(t). Only by a
thorough theoretical and numerical study can be seen the quality of parametrization for
specific classes of multiextremal problems.
Statistical Inference 185

The complexity of Algorithm 4.6.1 and its modifications is rather high, since several
times one needs to solve the complicated equation (4.6.15) and compute a value of c.d.f.
via the characteristic function. Hence, if f is easy to evaluate, it is unprofitable to use the
approach presented.
A natural way of applying the above described testing procedures of statistical
hypothesis, concerning quantiles of c.d.f.'s neutral to the right, consists of using them in
branch and probability bound methods for evaluating a prospectiveness criterion. To
construct the prospectiveness criterion, one can apply Steps 1-10 of Algorithm 4.6.1,
substituting Z for X and the record value of f in Z for f *. Acceptance of the hypothesis
Ho: f * ;? f 1-£ for £Z 0 corresponds to the decision that Z does not contain a global
maximizer of f, i.e. Z is unpromising for further search.
CHAPTER 5. METHODS OF GENERATIONS

The methods studied in this chapter consist of sequential multiple sampling of probability
distributions, asymptotically concentrated in a vicinity of the global optimizer. These
methods form the essence of numerous heuristic global random search algorithms and can
be regarded as a generalization of simulation annealing type methods in the sense that
groups of points are transformed to groups, rather than points to points.
The methods of generations are rather simple to realize, but are not very efficient for
solving easy global optimization problems. Nevertheless, numerical results demonstrate
that they can be applied even for solving very complicated problems (the author used them
for solving some location problems in which the number of variables exceeded 1(0).
Many methods of generations are suitable for the case when the objective function is
subject to random noise: this is the case generally considered in this chapter. For
convenience, we shall suppose here (as well as in the preceding chapter) that the
maximization problem of f is considered. Besides, the condition XcR.n is relaxed in this
chapter, and the feasible region X is supposed to be a compact metric space of an arbitrary
kind.
The theoretical study of the methods of generations is the main aim of the present
chapter which is divided into four sections. Section 5.1 describes some approaches to
algorithm construction and formulates the basic model. Section 5.2 investigates the
convergence and the rate of convergence of the sequences of probability measures
generated by the basic model. Section 5.3 studies homogeneous variants of the methods:
its results are closely connected with the theory of Monte Carlo methods. Finally, Section
5.4 modifies the main methods of generations in a sequential fashion and investigates the
convergence of these modifications.

5.1 Description of algorithms and formulation of the basic probabilistic


model

5.1.1 Algorithms
Beginning in the late 1960's, many authors suggested heuristic global random search
algorithms based on the next three ideas: (i) new points at which to evaluate f are
determined mostly not far from some best previous ones, (ii) the number of new points in
the vicinity of a previously obtained point must depend on the function value at this point,
(iii) it is natural to decrease the search span while approaching to a global optimizer. Such
algorithms are described e.g. in McMurty and Fu (1966), Rastrigin (1968), Bekey and
Ung (1974), Ermakov and Mitioglova (1977), Ermakov and Zhigljavsky (1983),
Zhigljavsky (1981, 1987), Price (1983, 1987), Masri et al. (1980), Pronzato et al. (1984)
and in many other works.
A general algorithm relying upon the above ideas is as follows.
Algorithm 5.1.1.

1. Sample some number N 1 times a distribution PI, obtain points Xl (l), ... ,XNl(1); set
k=1.
186
Methods o/Generations 187

2. From the points Xj(i) G=I, ... ,Ni; i=I, ... ,k) choose ~ points xl *(k), ... ,x~*(k)
having the greatest values of f.
3. Determine the natural numbers

(j = 1, ... , ~
applying some rule.
4. Sample nkj times the distributions Qk( xj*(k),dx) for all j=I,2, ... , t and thus obtain
the points

(k+l) (k+l)
xl "",x N . (5.1.1)
k+l

5. Substitute k+ 1 for k and go to Step 2.

Algorithm 5.1.1 becomes a special case of Algorithm 3.1.5 (i.e. the general scheme of
global random search algorithms) ifnkj of Algorithm 5.1.1 is used in the latter as Nk.
The lack of search span decrease IS equivalent to defining the transition probabilities
Qk(z,,) so as to meet Qk+l(z,B(z,E))~Qk(z,B(z,E)) for all k~l, zEX., E >0. The
particular choice of the rate of this decrease depends on the prior information about f, on
magnitudes of Nk and on the requirements to the accuracy of extremum detem1ination. As
it was established in Section 3.2, if one does not want to miss the global optimizer, then
the span must decrease slowly. For the sake of simplicity, the sampling algorithm for
Qk(z,.) is often defined as follows: a uniform distribution on X. is sampled with a small
probability Pk~O and a uniform distribution on a ball or a cube (their volume being
dependent on k) with centre in Zoo is sampled with probability 1-Pk. In this case the
condition

is sufficient for the convergence of the algorithm.


Although various versions of Algorithm 5.1.1 can be succesfully used in practical
applications, for theoretical research it seems to be inconvenient because (i) it is not clear
how to choose ~ (the choice 1=1 leads to the most well-known version of Algorithm
5.1.1) and (ii) much depends on the choice of numbers nkj which is not yet formalized.
Let us introduce randomization into the choice rule of nkj WIth a view of overcoming these
disadvantages.
Algorithm 5.1.2.

1. Sample N 1 times a distribution PI, obtain Xl (l)"",XNl(1); set k=l.


188 Chapter 5

2. Construct an auxiliary nonnegative function fk using the results of evaluating fat


the points xP) G=I, ... ,Ni, i=I, ... ,k)
3. Sample the distribution

Nk
P
k+l
(dx) =
.
L p~k)Q
J
(x~k) dX)
k\ J '
(5.1.2)
J=1

where

Nk
p(k)=f (x~»)/Lf (x~)\
k k J . k 1)
1=1

and thus obtain the points (5.1.1) of the next iteration.


4. Go to Step 2 substituting k+ 1 for k.

The distribution (5.1.2) is sampled by the superposition method: at first the discrete
distribution

(5.1.3)

is sampled and then the distribution Qk(Xj(k),.), if Xj(k) is the realization obtained.
Since the functions fk are arbitrary, they may be chosen in such a way that the mean
values Enkj of numbers nkj of Algorithm 5.1.2 correspond to the numbers nkj of
Algorithm 5.1.1. Allowing for this fact and fot' the procedure of determining quasi-
random points with distribution (5.1.3), one can see that Algorithm 5.1.1 is a special case
of Algorithm 5.1.2.
In theoretical studies of Algorithm 5.1.2 (more precisely, of its generalization
Algorithm 5.1.4) it will be assumed that the discrete distribution (5.1.3) is sampled in a
standard way, i.e. independent realizations of a random variable are generated with this
distribution. In practical calculations, it is more advantageous to generate quasi-random
points from this distribution by means of the following procedure that is well known in
the regression design theory (see Ermakov and Zhigljavsky (1987» as the construction
procedure of exact design from approximate ones. (Note that this is also a simple variant
of the main part extraction method used for reduction of Monte Carlo estimate variance at
calculation of integrals, Ermakov (1975». Set rj(k), the greatest integer, in Npj (k),

N
k (k)
_
N(k)- Lr. ,
j=l J
Methods o/Generations 189

Then

where

r x(k) , ... , x(k) 1


(2)
Ek 1
=
1
(k)/N(k)
0. 1
Nk
(k) IN(k)] .
, .. ·,o. N
k

Instead of Nk-fold sampling of (5.1.3), one can sample Ek(2) N(k) times and choose rj(k)
times the points Xj(k) for j=I, ... ,Nk. The above procedure reduces the indeterminacy in
the selection of points Xj(k) in whose vicinity the next iteration points are chosen according
to Qk(Xj(k),.). In the case of using this procedure, these points include some best points
of the preceding iteration with probability one. Besides the procedure is independent of the
method of determining Pj(k) (and is, therefore, usable in the case when evaluations of f
are subject to random noise).
The quality of the variants of Algorithm 5.1.2 greatly depends on the choice of fk that
should reflect the properties of f (e.g, be on the average greater, where f is great or
smaller, where f is small). Their construction should be based on some technique of
objective function estimation or on a technique of extracting and using infoffilation about
the ob)ective function during the search. Various estimates
fk of f
can be used as fk (in this case

may be used, instead of (5.1.2), where 11 is a finite measure on (X.,:B» as well as


prospectiveness criteria (see Section 4.3) and other functions related to f (e.g. solutions
of (5.3.5». A simple but practically important way of choosing fk is fk = f. The
resulting algorithm is readily generalized to the situation when a random noise is present.
This generalization is given below under the supposition that the result of evaluating f at a
point XE X. at iteration k is a random variable Yk(x)=f (x)+Sk(x) taking values on the set of
nonnegative numbers.

Algorithm 5.1.3.
1. Choose a distribution PI, set k=l.
2. Sample Nk times the distribution Pk, obtain points xl (k),,,,,XNk(k). Evaluate f at
these points. If
190 ChapterS

repeat the sampling.


3. Take the distribution Pk+ 1 in the form (5.1.2) where

(5.1.4)

4. Go to Step 2 substituting k+ 1 for k.


The heuristic meaning of Algorithm 5.1.3 will be discussed in the next section: its
essence is the fact that for great Nk the unconditional distributions of the random elements
Xj(k:) are close to the distributions

k
f (x)~(dx)1 It k(z)~(dz)
and, therefore, weakly converge for k~oo to the distribution concentrated at the global
maximizer of f.
Stopping rules for these or similar algorithms may be constructed through the
recommendations of Chapter 4 (termination takes place when reaching the accuracy
desired); the simplest rule (the prescribed number of iterations is executed) can be chosen
as well.
Algorithms 5.1.1 - 5.1.3 (as well as the Algorithm 5.1.4 presented below) will be
called methods of generations. This name originates from the fact that these algorithms are
analogous to or direct generalizations of Algorithms 5.3.3., 5.3.4, for which methods of
generations is the standard terminology in the theory of the Monte Carlo methods.

5.1.2 The basic probabilistic model

The algorithm presented below is a generalization of Algorithm 5.1.3 to the case, when fk
are evaluated with random noise: it is a mathematical model of Algorithms 5.1.1 through
5.1.3 and their modifications.

Algorithm 5.1.4.

1. Choose a distribution PIon (X,B) and set k=l.


2. Sample Nk times Pk and obtain points x1(k:), ... ,XNk(k:).
3. Evaluate the random variables Yk(X/ k )) at the points x/ k ), where
Yk(x)=fk(x)+~k(X)~O with probability one. If
Methods of Gene ratio liS 191

return to Step 2 (repeat sampling).


4. Take the distribution Pk+ 1 in the form (5.1.2) where Pj(k) are defined by (5.1.4).
5. Go to Step 2, substituting k+ 1 for k.

Measures Pk+ 1(dx), k=1,2, ... , defined through (5.1.2), are the distributions of random
points x/k+ 1) conditioned on the results of preceding evaluations of f. We shall study in
this chapter their unconditional (average) distributions which will be denoted by
P(k+ 1,Nk;dx). Obviously, the unconditional distribution of xP) is PI (dx)=P(k,No;dx).
192 Chapter 5

5.2 Convergence of probability measure sequences generated by the


basic model

5.2.1 Assumptions
For the sake of convenience, the assumptions used in this chapter and their explanation are
formulated separately. Assume that

(a) ~k(x) for any xe'X and k=1,2, ... are random variables with a zero-expectation
distribution Fk(x,d~) concentrated on a finite interval [-d,d]; the random variables
~kl(xl)' xk2(x2),··· are mutually independent for any kl,k2, ... and xI,x2, ... from 'X;
(b) Yk(x)=fk(x)+~k(x).;~.ct>O with probability one for all xe'X, k=I,2, ... ;
(c) O<qsfk(x)SMk=SUP fk(x)SC<oo for all xe 'X, k=1,2, ... ;
(d) the sequence offunctions fk(x) converges to f<x) for k~oo uniformly in x;
(e) Qk(z,dx)=<lk (z,x)f.1(dx),

sup q k(z,x) $. Lk < 00

Z,XEX

for all k=I,2, ... where f.1 is a probability measure on ('X,:8);


(f) the random elements Xlo ... , XN with a distribution R(dxlo ... dxN) defined on
:8N=cr('X x .• x 'X) are symmetrically dependent;
(g) the probability distribution PM(dxl, ... dxN) on:8M is expressed in terms of the
distribution RN( dx }, ... dxN) through

M N
PMCdx l' ... ,dx M) = f)lC dElN)PlaCElN) i~/(zi'~i,dX j) (5.2.1)
Z

where

z ='X x [ - d,d],

A(z,s,dx) = (f(z) + ~)Q(z,dx);

(h) the global maximizer x* of f is unique, and there exists c>O such that f is
continuous in the set B(x*,c)=B(c);
Methods a/Generations 193

(i) J..l is a probability measure on (X,:B) such that J..l(B(E»>O for any E>O;
G) there exists Eo>O such that the sets A(E)::::{xeX: f(x*)-f<xb;E} are connected for
any E, 0< E~Eo;
(k) for any xe X and k~oo the sequence of probability measures Qk(x,dz) weakly
converges to Ex(dz) which is the probability measure concentrated at the point x;
(1) for any xe X and k~oo, the sequence of probability measures R(k,Nk,x;dz)
weakly converges to Ex(dz);
(m) for any E>O there are 0>0 and a natural ko such that Pk(B(E»~O for all k~ko;
(n) for any E>O there are 0>0 and a natural ko such that P(k,Nk_l;B(E))~O for all
k~ko;
(0) the functions fk (k::::I,2, ... ) are evaluated without random noise;
(p) the transition probabilities Qk(x,.) are defined by

Qk(x,A) = JI[ 2E A,j k(x):5j k(z)] T k(x,dz) + I A (x)Jl[J k(z)<j k(x)] T k(x,dz)
(5.2.2)

where Tk(x,dz) are transition probabilities, weakly converging to Ex(dz) for k~oo and for
allxeX;
(q) PI (B(X,E»>O for all E>O, xe X;
(r) the transition probabilities Qk(x,dz) are defined by

(5.2.3)

where <p is a continuous symmetrical finite density in R.n,


00

(s) fk(x)::::f(x), ~k(x)::::~(x), Qk(x,dz)::::Q(x,dz) for each k::::I,2, ... ; and


(t) fk(x)::::f(x) for k::::I,2, ...
Let us comment now on the assumptions formulated.
Condition (a) requires that the evaluation noises be independent and concentrated on a
finite interval. The independence requirement does not seem to be of basic nature,
although this issue has not been investigated. The requirement of evaluation noise
finiteness, on the contrary, is necessary: if for example, the noise in a point distant from
x* is positive and very large, then all the evaluations will take place in the vicinity of this
point with high probability and therefore the search process will leave the global extremum
vicinity even if it was already there.
Superficially, the condition (b) seems to be restrictive, but the fact that one can
perform transformations on observed values enables one to set up optimization problems
194 Chapter 5

in a form where (b) is fulfilled. Indeed, if (b) is not met for a regression function h, then
one can determine ak~O in such a way that the probability of the event {sup Isk(x) I.::;ak}
is equal or almost equal to 1 and, instead of Yk(x), compute

where

- {/ k(x) - y k(x 0) + 2a k if Yk(x)-Yk(xO)+2ak~cl


/ (x) = otherwise
k c1

~
k
(x) = {S0 k(x) if y k(x) - Yk(x O) + 2a k ~cl
otherwise

and Xo is an arbitrary point from 'X. In this case a function, that is made arbitrarily close
to max {CJ '/ k(x)+const} by appropriate choice of ak, is the regression function, not
hex).
The assumption (c) whose major part is a corollary of (b) will be used for convenience
in some formulations.
The assumptions (h), (i) and G) are natural and non-restrictive. The uniqueness
requirement concerning x* imposed in order to simplify the formulations. This
requirement may be relaxed: considering the results presented below one can see that, in
fact, one deals with distribution convergence to some distribution concentrated on the set

n A (E) :::> {arg max/ex)}


£>0

rather than with convergence to Ex*(dx) and, therefore, if the unique maximizer
requirement is dropped, then convergence can be understood in this sense.
Conditions (e), (k) and (1) formulate necessary requirements on the parameters of
Algorithm 5.1.4 that may always be satisfied.
Assumptions (f), (g) and (s) are nor requirements but only auxiliary tools for
formulating Lemma 5.2.1. For Theorem 5.2.1, a similar role is played by (m) and (n) that
can be also regarded as conditions imposed on the parameters of Algorithm 5.1.4. They
are not constructive, however, and therefore easily verifiable conditions sufficient for
validity of (m) or (n) are of interest. The conditions (p), (q), (r) serve these aims for two
widely used forms of transition probabilities. The choice of a realization Yk of the random
element with the distribution Qk(x,dYk) as defined through (5.2.2) implies that first one
has to determine a realization ~k of the random element with distribution Tk(X,d~k) and
then set

Y
k
={~k
x
if / k(~J ~ / k(x)
otherwise.
Methods o/Generations 195

The transition probabilities Qk(x,.) may be determined by (5.2.2) if the functions fk


(k=1,2, ... ) are evaluated without random noise. In presence of random noise the choice
through (5.2.3) is a natural way of determining transition probabilities for 'XcRP. To
obtain a realization Yk of the random vector with the distribution Qk(x,dYk) as defined
through (5.2.3), one must obtain a realization Sk of the random vector distributed with the
density cp, to check the occurence x+~kSkE 'X (and, otherwise, to determine a new
realization Ck) and to assume Yk=x+ ~k Sk. Note also that the transition probabilities
Tk(x,.) of (5.2.2) in the case of 'Xc:Rn can be naturally chosen using (5.2.3).
Let us finally comment on condition (q) presenting both requirements to 'X and Pt.
This condition is best understandable in the case, when 'X is a subset of R.fl of non-zero
Lebesgue measure because then (q) means that the PI-measure of any non-empty ball in
Rn with the centre in 'X is larger than zero and that 'X has no appendices, i.e. parts for
which there exist non-empty balls in:R n having centres in these parts and with zero
Lebesgue measure of the intersection of these balls and 'X. The same simple interpretation
of (q) is valid in other practically important cases, where 'X is a part of an m-surface in
:Rn , is discrete set, or is a subset of a simply structured functional space (such as L2 or
C([O,I])).

5.2 .2 Auxiliary statements

The two Lemmas below are more important than the stages of proof in Theorem 5.2.1.
Lemma 5.2.1 will be used in Sections 5.3 and 5.4, and the statement of Lemma 5.2.2
contains very weak conditions sufficient for weak convergence of the distribution
sequence (5.2.9) to E*(.)= Ex*(')' that are of independent interest.

Lemma 5.2.1. Let the assumptions (a), (b), (c), (e), (f), (g) and (s) be fulfilled. Then
1. the random variables with the distribution PM (dx 1> ... ,dxM) are symmetrically
dependent;
2. the marginal distributions PM(dx)=PM(dx, 'X, ..., 'X) is representable as

(5.2.4)

where RN(dz)=RN(dz, 'X, ... , 'X), the signed measures L\N converge to zero in variation
for N~oo with the rate N-l12, i.e. var( L\N )=O(N-l/2 ), N~oo.

Proof The first statement follows from (f) and (g) and from the definition of symmetrical
dependence. Let us represent the marginal distribution PM(dx) as follows:
196 ChapterS

N
I>M(dx) = ~I1(d8N)a(8N\~IA(zi'~i,dX) =
Z

N
= i~ INI1(d8N)a(8N)A(zi'~i'dx) =
Z

= IN I1(d8 N)[Na(8 N)] A(z i'~i,dx).


Z

The resulting relation is represented in the form of (5.2.4) with

~N(dx) = I I1( d8 N)A(Y l'~l'dx) {Na(8 N) - [f f(Z)~N(dz)rl}.


i"
(5.2.5)
We shall show that ~N~O in variation for N~oo. With regard to (e), this convergence is
equivalent to the fact that

where

In order to prove this, we prove that for any 0>0 and xe'X there exists N*=N*(o,x) such
that for N~.N* there holds

(5.2.6)

The symmetrical dependence of the random variables Y(Xi)=j (Xi)+~(Xi) (i=l, ... ,N)
follows from the symmetrical dependence of random elements X1, ... ,XN and condition
(a). In virtue of the above and Loeve (1963), Sec. 29.4, the random variables

converge in mean for N~oo to some random variable f3 independent of all f3i, Y(Xi)
i=1,2, ... , and
Methods o/Generations 197

This can be formulated as follows: for any (; 1 >0 there exists N* 2:.1 such that
E I ~N-~ I<01 for N~N*. Denote 'I'=(t(XO+~(XO)q(XloX). Exploiting the independence
of ~ from ~N and '1', the condition (a), (b), (c) and (e) for the case (s), and also the fact
that

vrai sup 'I' :s; (sup t (x) + d)L =r.


(the constant L is from the condition (e», we obtain

Iv N(x)1 = IE('I' / ~N) - E'I'/E~I = IE('I'~/~ N) - E'I'I/E~:S;


:s; inf(E~)-1IE('I'~/~N - 'I')I:s; c 1E('I'j ~ -
1 ~NI/~N) :s;

Thus, if one takes 0= 0lC12/C, then (5.2.6) will be met for N?.N*. Moreover, it follows
from the last chain of inequalities that var (L1N)$.CI -2[E II3-~N I. From the central limit
theorem for symmetrically dependent random variables, see Blum et al. (1958), and the
inequality

which is a special case of the inequality given in Loeve (1963), Sec. 9.3, it follows that
E 1~-~N 1=O(N-l/2), N~oo. Consequently, var (L1N)=O(N-l/2), N~oo. The lemma is
proved.
By substituting tk, Nko Nk+1, P(k,Nk-1;.), P(k+I,Nk;·)'
P(k+I,Nk;dx)=P(k+1,Nk;dx,'X, ... ,'X), L1(k,Nk,.), respectively, for t,
N, M, RN(.),
PM(.), PM(dx), and L1N(.) and applying Lemma 5.2.1, we obtain the following
assertion.

Corollary 5.2.1. Let (a), (b), (c) and (e) be met. Then for any k=I,2, ... and
Nk=I,2, ... the following equality holds for the unconditional distribution of random
elements Xj(k):

P(k + l.Nk;dx) =
-1
= [fP( k,N k_l;dz)t k(z)] JP( k,N k_l;dz) t k(z)R(k,N k' Z; dx) (5.2.7)
198 Chapter 5

where

R( k,N k' z;dx) = Q k(z,dx) + d( k,N IJdx),

for any k=1,2, ... the signed measures d(k,Nk;.) converge in variation to zero for N~oo
with the rate of order Nk-l/2.
The next corollary follows from the above.

Corollary 5.2.2. Let (a), (b), (c) and (e) be met. Then for any k=1,2, ... the sequence
of distributions P(k+ 1,Nk;.) converges in variation for Nk~oo to the limit distributions
Pk(.) and

(5.2.8)

Lemma 5.2.2. Let the conditions (c), (d), (h), i and G) be met. Then the sequence of
distributions

J
f m(x)fl(dx)/ f m(z)fl(dz) (5.2.9)

weakly converges to e*(dx)= ex*(dx) for m~oo.

Proof Set Bi=B(Ej),

:0.I = 'X.\B. = {x E
I
'X.: Ilx - x*11 ~ e}, i = 1, ... ,4, K1 = sup f(x).
xeD 1

Choose an arbitrary value q >0. It follows from (h) that for any q >0 there exists e2,
0<£2<Q, such that

For any m>0, we have

f(X))m
J(~ fl(dx) > j (f(X))m (K2)m
~ fl(dx) ~ J K1 fl(dx)
1 2 2

By passing to the limit (for m~oo) in the inequality


Methods a/Generations 199

J f.!(dx)/ J(~2Jm f.!(dx) ~ J(fiX»)m Il(dx)/ J(fiX»)m Il(dx) ~O


D BID 1 B 1
1 2 1 1

we obtain

whence

(5.2.10)

where m~oo, and

Choose now an arbitrary function 'I'(x) that is continuous on 'X. By the definition of weak
convergence, it suffices to demonstrate that

(5.2.11)

For any <»0 there exists 103>0 such that I 'I'(xHI'(x*) I <0 for all XE B3. Setting
q=min{q,E3J we have

whence the validity of (5.2.11) follows in virtue of (5.2.10). The lemma is proved.
200 Chapter 5

5.2.3 Convergence of the sequences (5.2.7) and (5.2.8) to c:*(dx)

Below sufficient conditions are determined for weak convergence of the distribution
sequences (5.2.7) and (5.2.8) to E*(dx) for k--7 oo •

Theorem 5.2.1. Let the conditions (c) through (e) and (h) through G) be satisfied as
well as (k) and (m) or (1) and (n). Then the distribution sequence determined through
(5.2.8) (or, respectively, through (5.2.7)) weakly converges to E*(dx) for k--7 oo •

Proof. We consider only (5.2.8), because for (5.2.7) the proof is essentially the same, but
the formulas are more tedious. Choose from (5.2.8) a weakly convergent subsequence
Pki(dx) (this is possible in virtue of Prokhorov's theorem, see Billingsley (1968)) and
denote the limit by Il(dx) where 11 stands for a probability measure on (X,13). It follows
from (5.2.8) that the subsequence Pki+ 1(dx) weakly converges to the distribution
Ql (dx)=CIfCx)ll(dx), where CI is the normalization constant, and, similarly, Pki+m(dx)
converges to Qm(dx) of the form of (5.2.9). By means of the diagonalization one can
show that there exists a subsequence PkjCdx) that converges weakly to E*(dx).
In virtue of Theorem 2.2. of Billingsley (1968), the set of all finite intersections of
open balls with centers from countable and everywhere dense in X set and with rational
radii is the countable set defining convergence. Extract from the above set a subset 5
consisting of sets of Q1-continuity. Enumerate the elements of 5,
5={Aj}00
j=l
Fix a monotone sequence of numbers
{Em}"", Em> 0, Em --70 as m --7 00•

m=l
Since Pki+m(A)--7Qm(A) for any AE 5 as i--7 oo , there exists a subsequence

Rl ,m (dx)=P k . +m (dx)
'm

for which the inequality I R 1 meA l)-Qm(A 1) I< Em is valid for any m=I,2, ... Extract in
a similar manner from the seq~ence

{R j-l ,m(dx)} 00 (j = 2,3, .... )


m=l
a subsequence

{Rj,m(dx)} 00 for which IRj,m( Aj) - Qm( A j) 1< Em.


m=l
Methods of Generations 201

For any AiE 5, ism, the diagonal subsequence {Rm,m(dx)} has the property
IRm,m(Ai)-Qm( Ai) I<Em. This subsequence weakly converges to E*(dx): indeed, for all
AiE5,

Here the flrst term for m~i does not exceed Em and therefore approaches zero for m~oo;
the second term approaches zero in virtue of Lemma 5.2.2 where (m) plays the role of (i).
Thus there is a subsequence Pkj(dx) converging to E*(dx). It follows from (5.2.8) that
Pkj+l(dx) converges to the same limit and thus any subsequence ofPk(dx) converges to
this limit. The same holds for the sequence itself. The theorem is proved.
Let us note that all the previously used conditions (with the exception of (m) and (n»
are evident and natural. It is desirable to derive conditions that imply (m) and (n). The
corollaries of Theorem 5.2.1 presented below formulate the sufficient conditions for
distribution convergence to E*(dx) for two theoretically most important ways of choosing
the transition probabilities Qk(z,dx).

Corollary 5.2.3. Let the conditions (c), (d), (e), (h), (i), (j), (0), (p), (q), (t) and also
(k) for the transition probabilities Tk(x,dz) of (5.2.2) be satisfled. Then the sequence of
distributions determined by (5.2.8) weakly converges to E*(dx) for k~oo.

Proof It follows from (q) and (h) that PI (A(o»>O for any 0>0, and from (5.2.2) that
Pk(A(o»~",~Pl (A(o» for any 0>0, k=2,3, ... and therefore (m) is met. All conditions of
Theorem 5.2.1 concerning the sequence (5.2.8) are satisfled: the corollary is proved.
Corollary 5.2.4. Let the conditions (t), (e), (h), (i), (j), (q) and (r) be satisfied. Then
the sequence of distributions determined through (5.2.8) weakly converges to E*(dx) for
k---+oo.

Proof. Under our assumptions the distributions (5.2.8) have continuous densities with
respect to the Lebesgue measure. Denote them by Pk(x), k=I,2, ... It follows from
(5.2.3) that Pk(x»O for any k~1 and those XE'X for which f(x)",O. Let us show that (m)
is satisfled. Fix 0>0. It follows from (5.2.8) and the finiteness of cp that for any k and E>O
the following inequality holds

(5.2.12)

where Ek~O are defined in terms of 13k and the sizes of the support of density cp,
00 00

2,E k = const 2,13 k < 00.


I I
202 Chapter 5

Choose leo so as to make

and let 0 1 =P k (A(O/2)).


o

For any k~ko we have

(5.2.13)

thus (m) is satisfied: the corollary is proved.


Like Theorem 5.2.1, Corollaries 5.2.3 and 5.2.4 may be reformulated so as to assert
convergence of (5.2.7) to E*(dx). Let us reformulate Corollary 5.2.4 which is more non-
trivial of the two.
Corollary 5.2.5. Let the conditions formulated in Corollaries 5.2.1 and 5.2.4 be
satisfied. Then there exists a sequence of natural numbers Nk (Nk-7 oo for k-7 oo ) such
that the sequence of distributions P(k+ 1,Nk;dx) determined by (5.2.7) weakly converges
to E*(dx) for k-7 oo•

Proof. Repeat the proof of Corollary 5.2.4 changing only (5.2.12) and (5.2.13). Let us
require that Nk be so large that for any k the inequality
P(k+1,Nk;A(£+Ek))~(1-0k)P(k,Nk_1;A(E))
is satisfied instead of (2.12), where

(k = 1. 2., ••• ), I,0k < 00. (5.2.14)


1

This is possible in virtue of Lemma 5.2.1. Instead of (5.2.13), we have

P(k + ~N ~A(Ii)) >i~Y -li;lp[ ;A[1i/2 + ii,Iii ))>


k o,N ko _ 1

~0\!I (I-oJ
o
To complete the proof, it remains to exploit the fact that if (5.2.14) is satisfied then
I1(1-0J>O.
k=1

The corollary is proved.


Methods o/Generations 203

5.3 Methods of generations for eigen-measure functional estimation


of linear integral operators

5.3 .1 Eigen-measures of linear integral operators

Let us introduce some notations that will be used throughout this section.
Let X be a compact metric space; :8 be the a-algebra of Borel-subsets of X; 1'1, be the
space of finite signed measures, i.e. regular (countable) additive functions on :8 of
bounded variation; 1'1,+ be the set of finite measures on :8 ( 1'1,+ is a cone in the space 1'1,);
1'1,+ be the set of probability measures on :8(1'1,+cn+); C+(X) be the set of continuous
non-negative functions on X (C+(X) is a cone in C(X), the space of continuous
functions on X); C+(X) be the set of continuous positive functions on X (C+(X) is the
interior of the cone C+(X»; a function K: Xx:8-+R. be such that K(.,A)E C+(X) for
each AE:8 and K(X,.)En+ for each XEX. The analytical form of K may be unknown,
but it is required that for any XE X a method be known for evaluating realizations of a
non-negative random variable y(x) such that

Ey (x) =f(x) = K(x, X), var y(x) :::; a 2 < 00,

and of sampling the probability measure Q(x,dz)=K(x,dz)/f(x) for all XE {XEX: f(x)",O}.
Denote by :K. the linear integral operator from 1'1, to 1'1, by

:K.y(.) = I y(dx)K(x, .). (5.3.1)

The conjugate operator L= :JG*:C(X)-+C(X) is defined as follows

Lh(.) = Ih(x)K(.,dx). (5.3.2)

As it is known from the general theory of linear operators (see Dunford and Schwartz
(1958», any bounded linear operator mapping from a Banach space into C(X) is
representable as (5.3.2) and IILII=II:K.II=supf(x). Moreover, the operators:K. andL are
completely continuous in virtue of the compactness of X and continuity of K(.,A) for all
AE:8.
As is known from the theory of linear operators in a space with a cone, a completely
continuous and strictly positive operator L has eigen-value A. that is maximal in modulus,
positive, simple and at least one eigen-element belonging to the cone corresponds to it; the
conjugate operator L * has the same properties.
In the present case, the operator L is determined by (5.3.2). It is strictly positive
provided that for any non-zero function hE C+(X) there exists m=m(h) such that
Lmh(.)E C+(X) where Lm is the operator with kernel
204 Chapter 5

Thus, if the operator L=1(.* is strictly positive (which is assumed to be the case), the
maximal in modulus eigen-value A of 1(. is simple and positive; a unique eigen-measure P
in n + defined by

AP(dx) = f P(dz )K(z,dx) (5.3.3)

corresponds to it and A is expressed in terms of this measure as

f
A = f(x)P(dx). (5.3.4)

It is evident from (5.3.3) and (5.3.4) that if A"'O, then the necessary and sufficient
condition that P is a unique in n+ eigen-measure of 1(. is as follows: P is a unique in n+
solution of the integral equation
-1
P(dx) = [f f(z)P(dz)] fp(dz)K(z,dx). (5.3.5)

Assume that for any xl,x2, ... from X an algorithm is known for evaluation of the
random variables ~(x 1),~(X2), ... that are mutually independent and for any XE X are such
that E~(x)=h(x), var ~(X)~crI2<oo where h is some function from C(X).
In the following algorithms will be contructed and studied for estimation of the
functional

3 = (h,P) = f h(x)P(dx) (5.3.6)

of the probabilistic eigen-measure of the operator 1(.. In virtue of (5.3.4), this problem
includes the estimation of the maximal eigen-value of the integral operator (5.3.1) known
as estimation of branching process critical parameter or the problem of critical system
calculation (see Mikhailov (1966), Khairullin (1980), Kashtanov (1987»; the so-called
method 0/ generations with constant number o/particles was developed for solving this
problem. It finds wide applications in practical calculations and for its study a special
technique has been developed. This method is studied below by the apparatus of
Section 5.2.
Together with its modifications, it will go under the name generation method.
The connection between the problem under consideration and that of searching of the
global extremum of f is two-fold: in addition to the interrelation between the methodology
and technique of the algorithm investigation mentioned above, it turns out that the extremal
problems arise from the problems of estimating functionals of eigen-measures as limit
problems. This is discussed in the next section.
Methods o/Generations 205

5.3.2 Closeness of eigen-measures to e*(dx)

In this section we demonstrate that, in a number of fairly general situations, the problem
of determining the global maximizer of f can be regarded as the limit case of determining
the eigen-measures P of integral operators (5.3.1) with kernels Kp(x,dz)= f(x)Qp(x,dz)
where the Markovian transition probabilities Q~(x,dz) weakly converge to ex(dz) for
~~O.
In order to relieve the presentation of unnecessary details, assume that X=R.n, f.t=f.tn
and that Q~(x,dz) are chosen by (5.2.3) with ~k=~' i.e.

(5.3.7)

Lemma 5.3.1. Let the transition probability Q= Q~ have the form (5.3.7), where <p is a
continuously differentiable distribution density on R.n,

f Ilxll<p(x)f.tn(dx) < 00

f be positive, satisfy the Lipschitz condition with a constant L, attain the global maximum
at the unique point x*, and f (x)~O for II x II ~oo. Then for any £>0 and 8>0, there exists
~>O such that P(B(8))~1-£ where P is the probabilistic solution of (5.3.5).

Proof. Multiply (5.3.5) by f, integrate it with respect to X and let ~ approach zero.
Exploiting the Lipschitz property of f and the inequality from Kantorovich and Akilov
(1977), p. 318, obtain that the variance of the random variable f (~~) (where ~~ is a
random vector with distribution P=P~) tends to zero for ~~O and, therefore f (~~)
converges in probability to some constant M. To complete the proof, we shall show that
M= f (x*). Assume the contrary; then there exist c, q>O such that P(D~»O and f.tn(D~)~c
for all ~>O, where

f 1(x) = f (x)/f f (z)P(dz).


From (5.3.5) we obtain

P(Dp) = fp(dz)f l(z)Q(z,D p) 2:

~ J PCdz)fcz)Q(z,Dp) 2:(1 +el)P(D~)(1- 8~)


~
206 ChapterS

with B13~O for 13~0 which follows from (5.3.7): but this contradicts to the assumption.
The lemma is proved
Heuristically, the statement of Lemma 5.3.1 can be illustrated by the following
reasoning. In the case studied, P(dx) has a density p(x) that may be obtained as the limit
(for k~oo) of recurrent approximations

(5.3.8)

where
-n
cp l3(x) = 13 cp(x/13) ,

(5.3.8) implies that Pk+ 1 is a kernel estimator of the density sk+ 1Pkf, where the
parameter 13 is called window width. One can anticipate that for a small 13 the asymptotic
behaviour of densities (5.3.8) should not differ very much from that of distribution
densities (5.2.9) which converge to e*(dx), in virtue of Lemma 5.2.2.
Numerical calculations have revealed the fact that in problems resembling realistic ones
(for not too bad functions f) the eigen-measures P=PI3 explicitly tend, for small 13, to
concentrate mostly within a small vicinity of the global maximizer x* (or the maximizers).
Moreover, the tendency mentioned manifests itself already for not very small 13 (say, of
the order of 0.2 to 0.3, under the unity covariance matrix of the distribution with the
density cp).
The following example illustrates to some extent the issue on closeness of P(dx) and
e*(dx).

Example. Let X=R, f=N(a,(}"2), i.e. f is the density of the nonnal distribution with mean
a and variance (}"2, Q(x,dz) be chosen via (5.3.7), where cp=N(0,13 2 ). Now, one can
readily see from (5.3.5) that the nonnal distribution density with mean a and variance

is the density of P(dx). A similar result holds for the mutidimensional case, since the
coordinates may be considered independently, following an orthogonal transfonnation.
Now let us establish the possibility of using the generation method of the next
subsection for searching the global maximum of f: the essence being that all search points
in these algorithms have asymptotically the distribution P(dx) that can be brought near to
a distribution concentrated at the set X*={arg max f} of global maximizers by an
appropriate choice of the transition probability Q(x,dz). If this is the case, then the
majority of points detennined by the generation methods are in the limit sufficiently close
to X* and this is highly desirable for (random) optimization algorithms. (After carrying
Methods of Generations 207

out a sufficient number of evaluations of f in the vicinity of a point x*e X*, its position
can be detennined more exactly by constructing a regression model, say, polynomial
second-order one). This property is of special importance, when f is evaluated with a
random noise. Another positive aspect of the generation methods described below as
global optimization algorithms is their easy comparability in tenns of closeness of the
distributions P(dx) and E*(dx). It is noteworthy that the algorithms of independent random
sampling of points in X (Algorithms 3.1.1 and 3.1.2) can be also classified as belonging
to the generation methods described below if one assumes that Q(x,dz)=PI (dz). For these
algorithms P(dx)= PI (dx) and the points generated by them, therefore, do not tend to
concentrate in the vicinity of X * and, from the viewpoint of asymptotic behaviour, they
are inferior to those generation methods whose distribution P(dx) is concentrated near to
X*. (This way, the situation here is quite similar to the situation concerning the simulated
annealing method, see Section 3.3.2.)

5.3 .3 Description of the generation methods

Let PI (dx) be some probability distribution on (X,:B), that usually is taken to be uniform,
and

_ {K(Z,dX)/f(Z) if Hz) = K(z,X) ¢ 0


Q(z,dx) - P (dx) if f(z) = o.
1

It is assumed in the description of Algorithms 5.3.1 through 5.3.3. that XcRP, PI (dx) is
the unifonn distribution on X and the random variable ~(x) takes values on the set
{O,l, ... }.
The most straightforward algorithm used for a long time for estimating A. is based on
the N-fold sampling of the general branching process (see Harris (1963» defined by
K(z,dx) and consists in the following.

Algorithm 5.3.1.

1. Sample Nl=N times Pl(dx), obtain Xl(l), ... ,XNl(l) and set k=1.
2. Set i=l, Nk+l=O.
3. Sample the random variable ~(Xi(k», obtain a realization fi(k).
4. Sample ri(k) times Q(Xi(k),dx), obtain

(k+l) (k+l)
xN +l""'x (k)'
k+l N +r
k+l i

5. Put Nk+ 1= Nk+ 1+ ri(k)·


6. If i< Nk, put i=i+ 1 and go to Step 3.
7. If kg, put k=k+ 1 and go to Step 2, otherwise the calculations are stopped.
208 ChapterS

In this method and the subsequent algorithms the number of iterations is defined by a
number I.
Since the random vectors Xi(k) asymptotically (for N 1 ~oo, k~oo) follow the
distribution P(dx) (see Harris (1963), Algorithm 5.3.1 may be applied to the estimation
of the functional (5.3.6). The estimator is constructed in a standard fashion (used in
Monte-Carlo methods):

(5.3.9)

where I~Io::;I. In particular, for 10=1, I~oo, one obtains the well-known estimator for A:
A-NI+l/NI·
From the computational point of view, Algorithm 5.3.1 is inconvenient in the sense
that for A<1 the process rapidly degenerates (all the particles, i.e. points xi (k), die or leave
X), and for A> 1 the number of particles grows with k in geometric progression so that
their storage soon becomes impossible. Modifications have been made of the algorithm
with the purpose of overcoming the latter inconvenience: they are called generation
methods with constant number of particles and are described below.

Algorithm 5.3.2.

If the amount of descendants (i.e. points xi (k+l» at the k-th step of Algorithm 5.3.1 is
Nk+ I>N 1, then N=N 1 particles (points) of the next generation are randomly chosen
from them. If Nk+ 1<N, particles of the preceding generation are added in the same
manner until their number in the new generation becomes N.

Algorithm 5.3.3.

The new generation is formed in Algorithm 5.3.1 by N-fold random choice with
return of Nk+ 1 descendants of the particles of the previous generation. If Nk+l =0, the
sampling is repeated until Nk+ 1>0.
With the use of the above algorithms, the distributions of the random vectors xi (k)
tend to P(dx) (for k~oo, N~oo): therefore, with Nk=N, the estimate (5.3.9) of (5.3.6) is
asymptotically accurate.
Obviously, the efficiency of Algorithm 5.3.2 still depends on A: this algorithm is not
Markovian and, thus is difficult to study. Algorithm 5.3.3 whose rate of convergence can
be investigated by a special technique (see Khairullin (1980» is more attractive. Let us
write Algorithm 5.3.3 in a slightly more general and convenient form. To this end, note
that at the k-th iteration of Algorithm 5.3.3, the random choice with return from the set

(k+I) (k+I) (k+I)}


{ xl "",x(k)""'X N
r k+l
1
Methods a/Generations 209

is performed N times which is equivalent to N-fold sampling of the discrete distribution


concentrated on the set

(5.3.10)

with probabilities

p0-)=r~k)/I r~)
1 J 1
(5.3.11)
j=l

and subsequent sampling of the Markovian transition probability Q. This interpretation of


Algorithm 5.3.3 eliminates the integrality requirement concerning the random variables
~(x), while their non-negativeness is still prescribed.

Algorithm S.3.4.

1. Choose a probability distribution PI (dx) on (X,:8), set k=l.


2. Sample N times the distribution Pk(dx) to obtain the points (5.3.10).
3. Sample the random variables ~(xi(k») and obtain their realizations ri(k) (i=I,2, ... ,N).
If

I r~k) =
i=l 1
0,

repeat the sampling until this sum differs from zero.


4. Set

P
k+l
(dx) = I p0-)Q(x~k)
i=l
1 1 '
dX)

where p/k) are defined through (5.3.11).


5. Put k=k+ 1; if k.$.I, return to Step 2; otherwise stop.

Although Algorithms 5.3.3 and 5.3.4 coincide in the probabilistic sense, their
interpretations in terms of particles may differ, see Ermakov and Zhigljavsky (1985) (this
work describes also some other approaches to the estimation problem of (5.3.6)).

5.3 .4 Convergence and rate of convergence of the generation methods

This section deals with the generation method as formulated in the form of Algorithm
5.3.4. The analysis, like that of Section 5.2, relies upon Lemma 5.2.1, because Algorithm
5.3.4 is a special case of Algorithm 5.1.4 (in which Nk=N, fk=f, ~k=~'
210 ChapterS

Qk(z,dx)=Q(z,dx)) and Lemma 5.2.1 defines some fundamental properties of the latter
algorithm.
First let us prove an auxiliary assertion.

Lemma 5.3.2. Let the operator L=1(. *, defined by (5.3.2), be strictly positive, f... be the
maximal eigen-value of 1(.*, and P(dx) be the probabilistic eigen-measure corresponding
to this eigen-value. The operator U acting from n to n according to

Uv(.) = v(.) + f...-1 p (.)f f(z)v(dz) - f


f... -1 v(dz)f(z)Q(z,.) (5.3.12)

has, then, a continuous inverse operator.


Proof In virtue of Corollary from Kantorovich and Akilov (1977), p. 454, it suffices to
demonstrate that the equation Uv=O has no non-trivial solution belonging to n. As
follows from Fredholm's alternative result, this is equivalent to U *u=O, i.e.

(5.3.13)

has no non-trivial solutions belonging to C(X). In order to show this, multiply (5.3.13)
by P(dz) and then integrate with respect to 'X. If u satisfies (5.3.13), then it satisfies
1(. *u=f...u and f u(z)P(dz)=O: these relations together can be satisfied only by a function
that is identically equal to zero since, in virtue of the property mentioned in Section 5.3.1,
the non-zero eigen-function of 1(.*, corresponding to the eigen-value f..., is either strictly
positive or strictly negative. That proves the lemma.
Theorem 5.3.1. Let the conditions (a), (b), (c), (e) and (s) of Section 5.2 be satisfied;
assume that Q(z,dx)2:c21l(dx) for Il-almost all ZE X where c2>0 and the probability
measure 11 is the same as in the condition (e) of Section 5.2. Then
I) for any N=1,2, ... the random elements ak=(xl(k), ... ,XN(k), k=I,2, ... , (as
defined in Algorithm 5.3.4) constitute a homogeneous Markov chain with stationary
distribution RN (dx 1 , ... ,dxN), the random elements with this distribution being
symmetrically dependent;
2) for any E>O there exists N*2:1 such that for N2:N* the marginal distribution

differs in variation from P(dx) at most by E.

Proof Consider Algorithm 5.3.4 as that for sampling a homogeneous Markov chain in
D= XN. Denote the elements ofD by
Methods o/Generations 211

_«k)
a k-
(k»)
xl "",x N '

The initial distribution of the chain is

The transition probability is

d d
Q(a,db) = J... JF( xl'd~l) ... F( XN,d~N) x
-d -d

II i~l(J(Xi) + ~i)Q(Xi,dz j)1 ~~1(


N N N
x f(x l) + Sl)- (5.3.14)

Note that this transition probability is Markovian, because the Markovian assumption
concerning the transition probability Q(z,dx).
Let us prove that the recurrently defined distributions

Q k l(db) = JQk(da)Q(a,db)
+ D

converge to a limit in variation for k-+oo. Indeed, this follows from (5.3.14) and the
conditions of theorem, as
N N
Q(a,db) ;::: !I .L [c /( c 1 + (N - 1)( max f + d)) ]Q( xi,dz j) ;: :
J=1 1=1
N -1
;: : F1
p Ncicici + (N - 1)(maxf + d)] ~(dz -) =c 3P:(db)
J

where

is the probability measure on (:D,:8N), and

N
c 3 = [Nc 1c l( c 1 + (N -1)(max f + d))]
(obviously, O<c3<1). Now, it follows from the above said and Neveu (1964),
Supplement to Section V.3, that
212 Chapter 5

~o= sup sup {Q(a,B)-Q(b,B)}:S;I-c 3·


a,bED B=n N
In virtue of the exponential convergence criterion (see Loeve (1963), Section 27.3) the
distributions Qk(da) converge in variation for k-7 oo to a distribution RN(da) which is the
unique positive solution of

RN(da) = JRN(db)Q(b,da), (5.3.15)


D

moreover, the following relations are valid:

(5.3.16)

U sing the notation of assumption (g) of Section 5.2, rewrite (5.3.15) as

Using Lemma 5.2.1, we obtain that the random elements ak with distribution Qk(dak) are
symmetrically dependent for all k=I,2, .... Let us show that the random elements with
distribution RN (da) are symmetrically dependent as well. Assume for any.
B=(B 1,... ,BN)E:BN that

:F(B)={A=(B. ,B. , ... ,B. ) E:BN


II 12 IN

where (i 1h, ... ,iN) is an arbitrary permutation of (1 ,2, ... ,N»). Choose any two sets
BE:BN, AE :F(B). In virtue of the fact that Qk(B)=Qk(A), and for all k=I,2, ... (5.3.16) is
satisfied, we obtain that

The left hand side of the inequality is not influenced by k, and the right hand side tends to
zero for k-7 oo • Therefore, RN(B)=RN(A) for any BE:BN, AE:F(B), that is equivalent to
the symmetrical dependence of random elements with probability distribution RN(da).
Now, let us make use again of Lemma 5.2.1 with M=N, PN=RN: it follows that
R(N)(dx) is representable as

(5.3.17)
Methods of Generations 213

where ~N-70 in variation for N-7 00 with a rate of the order N-l/2.
Finally, let us consider the operator T mapping:M.xM, into:M. by

-1
T(~,R)(dx) = R(dx) - [J f(z)R(dz)] JR(dz)f(z)Q(z,dx) - ~(dx).

T is Frechet-differentiable with respect to the second argument at the point (O,P), the
derivative being TR'(O,P)=U where the operator U is defined by (5.3.12). Tn virtue of
Lemma 5.3.2, the inverse operator U-l exists and is continuous and, therefore, one can
apply the implicit function theorem to (5.3.17). This completes the proof.

If the conditions formulated in Theorem 5.3.1 are met, then one can estimate the
convergence rate of P(k,N,dx) to P(dx). Indeed, using (5.3.16) we obtain for all
k=I,2, ...

i.e. the distribution Qk(dx) converge with the rate of geometric progression for k-7 oo • On
the other hand, it follows from Lemma 5.2.1 and the implicit function theorem (see e.g.
Kantorovich and Akilov (1977) §4 of Ch.17) that var(R(NrP)~cN-l/2 where c>O is a
constant.
Thus, if the conditions of Theorem 5.3.1 are satisfied, then the distributions
P(k,N,dx) of the random elements x/k ) (j=1, ... ,N) are close (in variation) to P(dx) for
sufficiently large Nand k and, therefore, the estimator (5.3.9) is applicable to (5.3.6). In
the case of 10=1 the estimate of mean square error is readily derived:

The first term on the right side of the inequality (the systematic component) can be
estimated by means of the above estimates; the order of the second term (the random
component) is N-l, N-7 oo•
Two facts should be mentioned that follow from the above results and may prove
useful, together with the discussion of Section 5.3.2 concerning the generation method as
a global optimization algorithm.
First, if f 0 is close (in the norm of space qX» to f, then the solutions of (5.3.5)
corresponding to these functions will be close. This follows from the implicit function
theorem used in the proof of Theorem 5.3.1, Lemma 5.3.2 and the fact that Frechet-
derivative with respect to the second argument of V acting from qX)xM, to :M. according
to
214 Chapter 5

is VR'(f,P)=U in the point (f,P) where U is defined by (5.3.9). This fact justifies the use
of Algorithm 5.3.4 for optimization of estimating function f ' if the evaluations of f itself
are too expensive.
Assume now that the optimization problem is stated in terms of estimating the point x*
on the basis of a fixed (but sufficiently large) number No of evaluations of f (possibly,
with a random noise). If one applies Algorithm 5.3.4 and chooses Q (under the
assumption 'Xc:R.n) according to (5.3.7), then the following may be recommended for
choosing the algorithm parameters (3, N, and I: first, (3 is chosen so that

2
8 ((3) = var(P - e*) or

be small; then using the convergence rate with respect to N and prior information about J
(approximate number of local extrema, Lipschitz constant etc.) N is chosen; finally, I lS
taken to be the integer part of NofN.The closeness of the distribution ofrandom vectors
x/I) obtained at the last step of Algorithm 5.3.4 to the distribution e*(dx) can be estimated
using the estimates of the convergence rate and 82«(3), whose value can be estimated by
means of the results of Section 5.3.2. Indeed, one has

(5.3.18)

On the basis of this inequality, one can formulate the problem of optimal choice of (3, N
and I, as determination of the minimum of the right part of (5.3.18) under the constraint
NISN o. The numerical solution of this problem encounters significant computational
difficulties, due to the lack of or incomplete knowledge of the constants involved in the
estimate.
The investigation of the generation method, as described in this section, is based on
the apparatus for analysing global search algorithms. That is why the results obtained are
of fairly general character (in the sense of the techniques used), but need somewhat
specific assumptions that are natural in constructing global random search algorithms. Let
us remark that Mikhailov (1966), Khairullin (1980) and Kashtanov (1987) studied the
convergence rate of the generation method as presented in the form of Algorithms 5.3.3
and 5.3.4. The convergence rate estimate with respect to N was proved to have the form
o(N-I), N -+00, under assumptions that slightly differ from the above ones and are,
generally speaking, more natural for this problem. The approach described is,
nevertheless, still sensible because (i) it enables one to detect a number of qualitative
features of eigen-measure behaviour and (ii) it may be used, in virtue of its generality, for
the investigation of algorithms differing from the generation method (e.g. for sequential
algorithms described in the following section).
Methods a/Generations 215

5.4 Sequential analogues of the methods of generations

The algorithms of this section are modifications of those described in Sections 5.2, 5.3:
the basic difference is in the possibility of using the points obtained at earlier iterations -
rather than only those obtained at the preceding one - for determination of the subsequent
points.

5.4.1 Functionals of eigen-measures


The two algorithms described below are modifications of Algorithm 5.3.4 and can be used
like that algorithm for estimating functionals of the form (5.3.6), for sampling random
elements whose distribution is the eigen-measure of (5.3.1), or for estimating the
maximum of a function.

Algorithm 5.4.1.
1. Sample N 1 times a probability distribution PI (dx), obtain x 1o ... ,XN l' Evaluate
~'(xi) at these points where y(x)=f (x)+~(x);:::.o. If

Nl
LY(X.)=O,
i=l 1

then the sampling procedure is repeated.


2. Set k=Nl.
3. Set
k
P k + 1(dx)= LP.Q(x.,dx) where
i=l 1 1

4. Obtain a point xk+ 1 by sampling the distribution Pk+ 1(dx), evaluate y(xk+ 1).
5. If k~I, then the algorithm is terminated. Otherwise return to Step 3, substituting
k+l for k.

Similarly to the study of Algorithm 5.3.4, let us consider the asymptotic behavior of the
unconditional distributions P(k,dx) corresponding to Pk(dx). Note that P(k,dx)=Pl (dx)
for k~Nl'
U sing the symbols of the assumption (g) of Section 5.2, the distributions P(k+ 1,dx)
for k;:::.N 1 can be represented as

k
P(k + l,dx) = Jk n(d~\)a(eJ i LA(z
=1
.,~ .,dx)
1 1
(5.4.1)
Z

where Rk(dx 1, ... ,dxk) is the joint probability distribution of the random elements
xlo ... ,xk and
216 Chapter 5

Rl'X, ... ;X.,dx j' 'X, ... ,'X) = P(i,dx) for i = 1, .. ,k.

Theorem 5.4.1. Let the conditions (a), (b), (c), (e) and (s) of Section 5.2 be satisfied
and the operator]G * be strictly positive. Then the distributions P(k,dx) defined by (5.4.1)
weakly converge for k~oo and N 1~oo to the eigen-measure of]G, P(dx) corresponds to
the maximal eigen-value A.

Proof. The distributions P(k,dx) converge in variation for k~oo to some probability
distribution S(dx): indeed, the following takes place for any m~l, k~N 1+m:

P k +m (dx)=(I- Pk+m- I)P k+m- l(dx)+Pk+m- lQ(xk +m-l,dx) =

k+m-l k+m-l
j1]k(1- p i)P k(dx) + i2J+/ 1 - Pi)PkQ(xk,dx) + ... + Pk+m-lQ(xk+m-l'dx),

var(P(k + m,.) - P(k,.» ~ vrai sup var(P k+m - P k) ~

~ (m -l)vrai SU P [ .max P'J~ (m-l)(max f +d)/(c1k) ~O


k:"l:<o;k+m-l 1

for k~oo, i.e. the sequence of distributions {P(k,.)} is fundamental in variation. Let us
show that the limit S(dx) of the sequence coincides with P(dx). For any AE:B we obtain

i i=1I f TI(BJ[ka(BJ]A(zi'Si'A).
k
S(A)= lim P(k + 1,A) = lim
k~oo k~oo k
Z

The assertion will be proved, if one can show that for any i=1,2,oo.,k 0i ,k~O is valid (in
variation, for k~oo), where OJ ,kEn is defined by

0j,k(dx) = fkTI(dBk)A(Zi'Sj'dx)[ka(Bk) - lI ff(z)S(dz)]


Z

But this fact is proved by almost literally repeating the second part of the proof of Lemma
5.2.1, considering the fact that, for uniformly bounded sequences of random variables,
convergence in probability is equivalent to convergence in mean. The theorem is proved.

The following modification of Algorithm 5.4.1 is more convenient for practical


purposes.
Methods a/Generations 217

Algorithm 5.4.2.

In Algorithm 5.4.1 assume that for k>N (where N is a fixed number)

k
p.=y(x.)! l, y(x.);
I I j=k-N J

in all other details, repeat the operations of Algorithm 5.4.1.


Unlike Algorithm 5.4.1, in the present algorithm a constant number of.points (N) is
used at each k-th step for k2:N. Of course, the number N should be chosen in Algorithm
5.4.2 much greater, than in Algorithms 5.3.2, 5.3.3 and 5.3.4.
It is worth to note that in Algorithms 5.4.1 and 5.4.2 the distributions Pk+l(dx) are
expressed in terms of Pk(dx) in a recurrent manner: this facilitates the construction of
sampling algorithms for these distributions. We also remark that although further (both
numerical and theoretical) studies of Algorithms 5.4.1 and 5.4.2 are required, the first
experimental results are encouraging.

5.4.2 Sequential maximization algorithms


Everything said in Section 5.3.2 about the capabilities of Algorithm 5.3.4 as that of
optimization, completely applies to Algorithm 5.4.2. Moreover, the latter algorithm may
be improved when the objective function f is evaluated without random noise.

Algorithm 5.4.3.

Perform the same operations as in Algorithm 5.4.2, but for k>N disregard the point
where f has the least value of all the points included into the set of N points rather than the
point Xi, i=k-N.
The question of convergence in this case, for any set of parameters N 10 N is solved in
a simple manner: if f is continuous in the vicinity of at least one of its global maximizers
and if Q(X,B(Z,E))2:0(E»0 for all X,ZE 'X and 10>0, then the sequence {f (xk)} converges
in probability to f(x*) for k~oo.
As opposed to the parameters N 1and N of Algorithm 5.4.2, their counterparts in
Algorithm 5.4.3 may be chosen in an arbitrary manner. If it is a priori improbable that f
reaches a local maximum far from the global one, having a value near to max f, then even
N 1=1 and N=l become acceptable. The resulting algorithm becomes Markovian (see
Section 3.3), converges under the above conditions, and the limiting distribution of points
xk is concentrated on the set of global maximizers.
Algorithm 5.4.3 is rather similar to the well-known controlled random search
procedure of Price: its essence is as follows (for more details, see Price (1983, 1987)).
At the k-th iteration (k2:N>n), n+1 distinct points zl, ... ,zn+l are chosen from the
collection Zk consisting ofN points in store; these points define a simplex in :R.n. Here zl
has the greatest objective function value evaluated so far, and the other n points are
218 Chapter 5

randomly drawn from the remaining N-l points ofZN. A point xk+l is determined as the
image point of the simplex's pole zn+ 1 with respect to the centroid z of the other n points,
i.e. xk+ 1=2 z-zn+ 1. If xk+ 1e X, then the operation is repeated. If

then let Zk+ 1=Zk, otherwise xk+ 1 is included into Zk+ 1 instead of the point from Zk with
the least objective function value. From the abovesaid it follows that Algorithm 5.4.3 and
Price's algorithm differ only in the way of choosing the next trial points.

5.4.3 Narrowing the search area

As it is demonstrated by numerical calculations, one can somewhat improve convergence


of Algorithm 5.2.4 by gradually narrowing the search area, i.e. by tending Qk(z,dx) to
Ez(dx) for k-7 oo. For sequential algorithms a similar problem is still to be analysed. It is
evident only that the narrowing operation should be made slowly enough, to avoid
missing the global maximizer. For such a narrowing, one can prove an assertion similar to
that of Theorem 5.2.1 on convergence of the sequence of unconditional distributions
generated by the algorithm to E*(dx). Since its rigorous proof would require too much
space, only the key points are given below.
Let the transition probabilities Q(z,dx) in Algorithm 5.4.1 depend on the iteration
count: Q=Qk, for k-700 the transition probabilities Qk(z,dx) be uniformly convergent with
respect to z in variation to some transition probability Q(z,dx). Then an analogue of
Theorem 5.3.1 is valid - only a term will appear in the estimate var(P(k+m,.)-P(k,.» that
tends to zero. Analysing the equation for the limit measure that has the form (5.3.5) with
K(z,dx)=f (x)Ez(dx), one obtains that the limit distribution is concentrated on the set
{XE X:f (x)=c}, where c is some constant. Finally, in order to show that c=max f, one
needs to impose on the transition probabilities Qk(z,dx) conditions, similar to those of
Section 3.2, that ensure that

lim infiixk - x*ii= 0


k~oo

with probability one.


CHAPTER 6. RANDOM SEARCH ALGORITHMS FOR SOLVING
SPECIFIC PROBLEMS

One may encounter various difficulties while applying the above developed technique to
specific problem-classes. In this chapter we shall discuss the ways of overcoming some
difficulties arising in constrained, infinite dimensional, discrete and multicriterial
optimization problems.

6.1 Distribution sampling in random search algorithms for solving


constrained optimization problems
The application of random search methods to constrained optimization with equality-type
constraints inevitably involves sampling of some probability distributions on the surface
defined by the binding constraints. Below, algorithms are constructed for distribution
sampling on general and on some particular manifolds in Rk, relying upon well-known
facts of mathematical analysis, multi-dimensional geometry and probability theory.

6.1.1 Basic concepts

Let X be a Borel subset ofRn, n~l: (X is the parameter space), B be the a-algebra of
Borel subsets of X, Iln(X)>O, CV be a continuously differentiable mapping of X into Rk
(k~n), By be the a-algebra of Borel subsets of the set Y=CV(X), x=(xlo ... ,xn),
y=(yt, ... ,Yk), cv=(CPl, ... ,CPk)· With this notation, y=cv(x) means that

rYl = CPl(Xl""'X n )

1; ~'~'~'i~~::: ::~'~').
For any XE X, set

k acp (x) acp (x)


d .. (x)= L a~ a~ (i,j = 1, ... ,n),
IJ ~=l Xi Xj

where the determinant under the root sign is always non-negative, in virtue of non-
negative definiteness of the matrix II dij(X) II. The following relation

Jf(s)ds = J f(CV(x»)D(x)f.1 n (dx)


B -1
«1> (B)

219
220 Chapter 6

is known to be valid (see Schwartz (1967), § 10 of ChA) for any :By-measurable function
f defmed on Y and any set B from the set {B: B=<I>(A), AE:B} where ds is the surface
measure on Y=<I>('X). Hence, for any measurable non-negative function f defined on Y
and satisfying the condition

jf(<I>(x»D(x)Jln(dx)= 1,

the probability measure

P(dx) = f(<I>(x»D(x)Jl n (dx)

induces the distribution f(s)ds on the manifold Y=<I>('X). In the important particular case
of

c = jD(x)Jln(dx) < 00

the distribution

induces a uniform distribution c-1ds on the manifold.


It follows from the abovesaid that distribution sampling on Y is reduced to that in the
parameter space 'X: indeed, in order to obtain a realization Sof a random vector in Rk
with the distribution f<s)ds on (Y,:By), it suffices to obtain a realization ~ of a random
vector in Rn with the distribution f(<I>(x»D(x)Jln(dx) on ('X,:B) and to take s= <I>(~).

6.1.2 Properties ofD(x}


Below some of the properties of D(x) are listed that will be used in the construction of
distribution sampling algorithms on particular surfaces.
In the sequal <l>i (i=1,2) will be understood as the mapping

i) (i))
<l>i = ( CPl ,···,CPk

of'XjcRn on YiCRk of the (smooth) function of class C1,


Random Search Algorithms 221

where

(i) k dq>ii\X) dq>ii)(X)


d .(x)= L ::'x ~::..---
~J t=1 0 ~ oX j

Lemma 6.1.1.
1) Let Y 1=Y 2=Y, H be CLdiffeomorphism of 'X 1 on 'X2 such that «I> 1=«1>2 • H, f be
a :By-measurable function, f~O,

Jf(s)ds = 1.
y

Then the distributions f(<<I>i))Di(x)j.l'n(dx) on 'Xi (i=1,2) induce the same distribution
J(s)ds on Y.
2) Let 'Xl ='X2='X, «1>1 =c«l>2+b where c is a constant and b is a constant vector. Then
D1 (x)= Ic I nD2(x) for all XE 'X.
3) If «I> is linear with respect to each coordinate, then D(x)=const.
4) Let 'Xl ='X2='X, «1>1 (x)=B«I>2(x), where B is an orthogonal (kxk)-matrix (i.e.
BB'=Ik). Then D1(x)=D2(x) for all XE'X.
5) If n=k, then D(x)= I a «1>/ ax I is the Jacobian of the transformation «1>.
6) If k=n+1, q>j<x)=Xj G=1,2, ... ,n), then

V2

D(x) =[1 + i~1(dq>n+l(X)/dXi)2] .

Proof. The first assertion readily follows from Theorem 106 of Schwartz (1967). The
second, third and sixth statements are verified by direct calculation of the determinant. The
fifth follows from the fact that

Let us now prove the fourth assertion; we have

(i = 1. ... ,k)

where
222 Chapter 6

~b kj b k~ = 0 . ~ = {O1 if j *~
ko
t=1 J
if j =t
Let us show that

for all XE X and i, ~ = 1,... ,n . Indeed,

1 k [k acp~2) k acp(2)]
d()= b _J_ b _t_=
."
1<
L.L=1
m=1
mJ' ax.1 L
t= 1
mt ax
~

k k acp ~2) a (2) k k acp ~2) acp (2) 2


-
-
'"
ko
",_J_
ko ax
~
ax
'" b
ko mJ'
b
mt
= . '"
ko
_J_ _ _t_ 0
ax. ax J't
= d(. )
n •
j=1 t=1 i ~ m=1 J,t=1 1 ~ 1<

The lemma is proved.

The above results enable us in some cases to simplify the distribution sampling on
manifolds. For instance, it follows from the fourth assertion that a uniform distribution on
an n-surface may be defined to within an orthogonal transformation. Let us remark finally
that the validity of statements 2 - 6 of Lemma 6.1.1 follow the diffeomorphism theorem,
see Schwartz (1967).

6.1.3 General remarks on sampling

The relation between a distribution on the set X having a non-zero Lebesgue measure and
on the manifold Y=<I>(X) enables the reduction of sampling on Y to that on X which is
usually much simpler and can be solved by standard methods (such as the inversion
formula, the acceptance-rejection method or other procedures described in Ermakov
(1975) or in Devroye (1986)). Some methods of sampling complicated distributions on Y
can be used directly on manifold Y without any changes (which, of course, corresponds
to applying the method on Y). For convenience of references, let us briefly describe the
acceptance-rejection method on Y.
Let two distributions be defined

P . ( ds)
1
= cp 1. ( s)ds (i = 1,2),
The acceptance-rejection method consists in sequential sampling of pairs of independent
realizations {~j>aj} of a random vector with distribution PI (ds) and a random variable
with the uniform distribution on [0,1], until the inequality g(~j)::;aj is observed. The latter
Random Search Algorithms 223

of ~j is a realization of a random vector with distribution P2(ds). The mean number of


pairs produced by the method equals c, for generating a realization that follows P2.
Below we describe how this general philosophy can be applied to distribution
sampling on various surfaces of first and second orders. In doing so, we shall use the
notation

Obviously, the construction of a sampling algorithm for sampling a part of a distribution


on y+ and of a similar algorithm for Y\ y+ is equivalent to the construction of a sampling
algorithm for the distribution itself.

6.1.4 Manifold defined by linear constraints

Let Y be a non-empty set of the form

where Gi is an mixk-matrix, Ei is an mi-vector, and the simultaneous equations G2y=E2


define an n-dimensional plane. By virtue of the fourth assertion of Lemma 6.1.1, one can
regard the left part of G2 as an unity matrix In. Assume that xj=Yj (j=1,2, ... ,n) and obtain
from the equalities G2y=E2 that y=G3x where G3 is an kxn-matrix whose upper left part
is In. Assume that

and <I>: <I>(x) = G 3x.


The mapping <I> transforms the uniform distribution on 'X into the uniform distribution on
Y, because D(x)=const., by vurtue of the third assertion of Lemma 6.1.1. To sample a
uniform distribution on Y, it suffices, therefore, to sample the same on 'X. A similar (but
distribution-specific) assertion holds for non-uniform distributions.

6.1.5 Uniform distribution on an ellipsoid


As it is well known, any ellipsoid can be reduced by means of an orthogonal
transformation and translation to the following form

I. a~y? =
i=l
I I
I} where a. > 0
I
(i = 1, ... ,k).

By virtue of the second and fourth assertions of Lemma 6.1.1, the uniform distribution on
Y is uniform on the original ellipsoid. Therefore, it suffices to sample only the first of
these distributions.
224 Chapter 6

Set n=k-1, <1>1: 'X---t Y,

'X ={ XE R n: ia.x~::; 1}'


i=l
1 I

y = (!> ,(x) = (Xl" ",X k_I'(I- ;t.;x;f2,••),. (6.1.1)

The following distribution

corresponds to the uniform distribution on Y+, where 2/q is the volume of ellipsoid Y
and

The representation <1>1 =<1>4°<1>3°<1>2 is valid, where <1>2 is the mapping of 'X into

as defined by

+ +
<I> 4: S (k) ---t Y ,

Mapping by <1>2, the distribution P(dx) on 'X becomes


Random Search Algorithms 225

on B(n) which induces on S(k)+ the distribution

1/2

F(ds) = c 3( I a~s~)
i=1 1 1
ds,

Sampling the above four distributions (on Y+, S(k) +, 'X and B(n» is equivalent. Most
naturally, F(ds) on S(k) + is to be sampled by means of the acceptance-rejection method,
where P2(ds)=F(ds), PI (ds)=qds, and q=2n:- k/ 2['(k/2+ 1) with 2/q standing for the
volume of unit sphere S(k). Sampling PI (ds) is well-known, see e.g. Zhigljavsky (1985),
Devroye (1986).
Algorithms for sampling non-uniform distributions on an ellipsoid may be constructed
in a similar manner.

6.1.6 Sampling on a hyperboloid

Let the distribution F(ds)=f(s)ds (f ~O, f Y+f (s)ds=l) be defined on a hyperboloid


which will be written without loss of generality as (k2:3, l~m~k-l)

Y={YERk: fb.y?=I, b.<O (i=I, ... ,m), bJ.>O o=m+l, ... ,k)}.
i=1 1 1 1

Set n=k-l,

'X = {x ERn: .
i b. x?:5: I},
1=1
1 1
<1>: 'X ~
+
Y ,

The distribution PI (dx)=/ 1(x)lln(dx) on 'X corresponds to F(ds) on Y+, where


226 Chapter 6

Assume further that

and defme the mapping cDl: X---+Xl as follows:

(for i = l, ... ,m),

(for j = m + 1. ... ,n).

The Jacobian of the transformation <1>2, inverse to cDl, is

1/2)(
=( nlbJ
n m )(n-m)!2
D 2(z) 1+ .I.z~
1=1 J=1 J

and the transformation cD2= cD(1 itself is defined as

(for i = 1. ... ,m)

(for j = m + 1. ... ,n).

For mapping cDl, the distribution

on Xl corresponds to Pl(dx) on X.
Thus, in order to determine a realization ~ of a random vector with distribution F(ds)
on Y+, one has to obtain a realization ~ of a random vector with distribution P2(dz) on
Xl, and to take ~=cD(cD2(~». Sampling distributions on a cylinder Xl does not encounter
any serious difficulty.
Random Search Algorithms 227

6.1.7 Sampling on a paraboloid

Let a distribution F(ds)=/ (s)ds be defined on a paraboloid

where b."# 0 for i = 1, ... ,k - l.


I

Set n=k-l, X=R.ll, <1>: X-tY, and

In this case

1/2

D(x) = (1 + 4i b~X~)
i=l I 1

and the distribution f(<1>(x»D(x)lln(dx) on RP corresponds to F(ds) on Y.

6.1.8 Sampling on a cone

Let a distribution F(ds)= f(s)ds be defined on a cone

where bj>O (for i=l, ... ,m), bj<O (for j=m+1, ... ,k-1), bk=-l, k!2~m<k.
Set n=k-1,

+
<1>: X -t Y ,

In this case

1/2

D(x) = [1 + (i b~X~J/ i
i= 1 I i i =1
b .x~]I
1
228 Chapter 6

and the distribution

P(dx) = f(<1>(x))D(x)ll n (dx)

on X corresponds to F(ds) on Y+. If m=k-l, then X=:R.n and the sampling P(dx)
presents no basic difficulties. Now let m<k:-l. Assume that

is an unbounded cylinder, <1>1: X~Xl> z=<1>(x): Zi=bi l!2xi for i=I, ... ,m,

z·=lb.1
1/2
x.
(mLb.x~ )-112 for j = m + 1, .•• ,n.
J J J i=1 1 1

The Jacobian of the mapping <1>2= <1>1- 1 equals

112)( m )cn - m)12


D2(z) = ~nib ·1-n
=1 J
.LZ~
J=1 J

and the mapping <1>2: Xl ~X is defined as follows:

-1/2
x = <1> 2(z): x. =b. z. for i= 1, ... ,m,
1 1 1

x .=
J
Ib J·1-l/2z.J(mi=l
L z?
)112
1
for j = m + 1, ... , n.

For the mapping <1>°<1>2, the distribution

on Xl corresponds to F(ds) on Y+.


Thus, sampling random vectors on non-degenerate second-order surfaces has been
reduced by means of the results obtained at the beginning of this section to sampling on
sufficiently simple sets. In the case of degenerate second-order surfaces, the above
discussion needs only a slight modification.
Random Search Algorithms 229

6.2 Random search algorithm construction for optimization in


functional spaces, in discrete and in multicriterial problems
Many of the above global optimization methods may be applied without essential
modifications to problems in which the feasible set 'X is discrete or is a subset of a
functional space. Nevertheless, these optimization problems have some distinctive features
whose careful consideration could contribute to the efficiency of the methods used. For
instance, in discrete extremal problems, the usual difficulty lies in selecting metrics on 'X,
while optimizing in functional spaces the basic difficulty is in selecting a method reducing
the problem to a finite-dimensional one.
The aim of this section is to consider some aspects of the above-mentioned problem
features as well as possible ways of using random search algorithms in multicriterial
optimization problems.

6.2.1 Optimization infunctional spaces

Optimization in functional spaces means that the set 'X belongs to a functional space, and
f' is a set of functionals f: 'X~R. For example, numerous problems of mechanics and
control are reducible to such optimization. As usual, the consideration of problem specific
features enables one to develop specific and fairly efficient solution methods: e.g. the
carefully elaborated calculus of variations is usually employed for optimization of integral
functionals dependent on an unknown function and its derivatives.
Attempts to apply general numerical methods to optimization in functional spaces
usually do not meet with basic difficulties cf. e.g. Vasil'ev (1981). Formally, many of the
random (including global) search methods also can be used for functional optimization,
although this gives rise to some specific problems related to the distribution sampling in
functional spaces.
Two major problems - uniform random choice of functions from 'X and choice of a
function close to a given one - occur in distribution sampling which corresponds to
sampling of stochastic processes or fields. The way of uniform choice process
organization should be completely dependent on 'X: if 'X is a subset of a space of the
C[a,b] type, the Wiener measure may be chosen as the uniform measure in 'X; if there is
some prior information about smoothness of the functions in 'X, one of the Gaussian
measures whose trajectories have the desired smoothness should be chosen instead of the
Wiener measure. Gaussian measures are preferable because they easily let themselves to
theoretical study and there exist quite a few algorithms for their sampling.
Sampling a random function close to a given one is equivalent to sampling a random
function close to zero: for solving this problem, one can also employ the methods of
sampling Gaussian measures in a special manner, but the following two methods provide
a more convenient way to parametrization of this problem under the assumption that 'X is
a subset of one-dimensional functions. The first method is based on the fact that any
Gaussian process is representable as

x(t,ro) = 2.A.~.(ro)<p.(t) (6.2.1)


i =1 1 1 1
230 Chapter 6

where Ai and CPi(t) (i=I,2, ... ) are the eigen-values and corresponding orthononnalized
eigen-functions of the correlation operator of the process, and ~1o~2"" are independent
nonnally distributed random variables with zero mean and unit variance. Sampling is
defined by a finite number defining the number of tenns in the decomposition (6.2.1) or a
distribution on the set of these numbers, by fixing a basic set ( CPi(t)} or several sets
among which one is randomly chosen each time, and by fixing small values of Al ,A2, ...
or defining a corresponding distribution on the set of these numbers. Now the desired
realizations of a close-to-zero Gaussian process are obtained through sampling the random
variables ~1'~2"" and all the specified distributions and substituting them into (6.2.1).
The second method of sampling a close-to-zero random parametrically defined function
consists in the preliminary reduction of a given class of random functions z(t) to the class
of parametrically defined z(t,S), SE e, functions with subsequent sampling of parameters
SE e. As for the parametrization, it is natural to take it as non-linear, since this provides a
great variety of fonns and profiles of the curves z(t,S) with a small number of unknown
parameters.
For defining on the parameter set a distribution to be sampled, the only point to
mention is that the quasi-unifonn distributions (see Kolesnikova et al. (1987)) defined by
the condition of equal probability that the cross-section z(t o,S) of z(t,S) passes through
any point of a given interval [zl,z2]:

seem to be promising. Whereas the quasi-unifonn distributions enable one to define


unifonnity on the set of values of chosen random functions, the unifonn distribution on
the parameter set could result in a situation where mostly unreasonable functions z(t,S)
are chosen.
When optimizing in functional spaces, one usually passes to finite dimensional
optimization. In doing so, the simplest and widely used way consists in substituting
functions X(t)E 'X, tE T, by their values on a point set {tl ,... ,tm }, tiE T, with subsequent
(usually, piecewise-linear) approximation of x(t). Here the values x(t i), i=I, ... ,m, serve
as optimization parameters. This optimization is hardly successful because (i) many points
t i should be defined in order to support the desired accuracy of the approximation, thus
the finite dimensional optimization becomes multiparametric and, as a rule, multiextremal:
(ii) it is very difficult to take into consideration a priori infonnation on the smoothness of
the functions from 'X - this may be done only by defining the correlations at random
choice of unknown parameters of x(ti), i=I, ... ,m; and (iii) if the optimization parameters
are regarded as independent (this usually is the case, meaning that prior infonnation of
smoothness and degree of multiextremality of the optimal function X*E'X is completely
ignored), the result may be meaningless.
Reduction of the set 'X to the set
Random Search Algorithms 231

'Xn{x(t,e) = . I.e 1'<P.(t)}


1
1=1

of functions, linear with respect to the unknown parameters 9=(91 ,... ,9 m ) to be


optimized, is a more general and, heuristically, more attractive approach as compared with
the above one. This technique was used, for instance, by Chen and Rubin (1986). Its
drawbacks lie in the uncertainty of choosing {<Pi(t)} and in the presence of all the
disadvantages of the first approach (although to a smaller degree).
The transition to functions that are nonlinearly dependent on the parameters provides a
third way of passing from infinite dimensional optimization to finite dimensionality. The
case in which T=[O,oo) and the optimal function x*(t)e'X is known to be unimodal,
smooth, positive for t>O, and x*(O)=O, x*( +00 )=0 indicates a successful transition to
functions that are nonlinearly dependent on the parameters. In this case it is natural to
narrow 'X to the set of parametrically defined functions

'Xn{x(t) = x(t,9)},

where x(t, e) are, for instance, functions of the following form

(6.2.2)

where epo (i=I,2,3,4) are unknown parameters. Functions of the form of (6.2.2) are
well known to approximate with a high accuracy any function with the above properties.
Other methods of parametrization are in existence as well.

6.2 .2 Random search in multicriterial optimization problems

In complicated practical problems, one often comes across a situation where points of 'X
(called admissible solutions) are to be compared by multiple criteria rather than by a single
one, i.e. the preformance of the decision variants is evaluated by vector functions
F=(f 10···, f m)', f i: 'X~R (i=I, ... ,m). A vector y from Y=F('X)cRm will be called an
estimate, and an estimate y* from Y such that there is no ye Y for which y"'y* and ys:y*
(i.e. the inequality.$. holds for each component) will be referred to as an admissible
estimate. The set of admissible estimates is called the Pareto set, and the corresponding set
of feasible solutions is called the effective (Pareto-optimal) solution set.
Although a large portion of the literature dealing with multicriterial optimization is
devoted to the analysis of the Pareto set P and similar subsets of Y, for practical purposes
the description of the effective solution set €c'X for given criteria F is of primary
importance. In practice, one can obtain this description only by forming a fairly
representative sample of € and then approximating it (e.g. by piecewise linear or
piecewise quadratic approximation): considerations below are given to the generation of
such a sample.
232 Chapter 6

Some points of the set £ may be obtained solving minimization problem of scalar
trade-off criteria

f A. (x) =A, F(x), I,A,.=1}


i=l 1

(consideration may be given also to other trade-off criteria sets); note that all the above
points constitute the entire set £, if all the scalar criteria fi (i=I, ... ,m) are convex.
Although the individual minimizers of fA,(x) are not sufficient for describing the whole set
£, they are usually sufficient for obtaining a representative sample from £.
Theoretically, the minimization of fA,(x) under various parameters A, is the simplest
way of determining points from £. This way, however, may be inefficient, because of the
difficulties related to the solution of the single-criterion problems: indeed, for non-convex
(but, possibly, uniextremal) criteria f i (i=1 ,... ,m), the trade-off criteria fA.(x) are (already)
multiextremal. Moreover, small variations of A, can result in abrupt changes of the global
minimizer location. We indicate a number of approaches that are based on ideas of random
search, readily yield themselves to algorithmization and programming and might prove
useful in solving the problem of describing £.
First, it is natural to take such values A, that are independent realizations of a random
vector uniformly distributed on Sm.
Second, the search of minima of fA,(x) should be carried out for all functions in a
simultaneous manner, all the points obtained being tested for feasibility and rejected, if
found inadmissible.
Third, local random search is possible in various versions of the algorithm below if £
is a priori known to be connected. At the first iteration, one or more points of £ are
determined as the minimizers of criteria fA.(x); having several points of £, construct at the
k-th iteration Ak - the linear hull of points - and then determine in a random fashion
several new points in 'X near and far from Ak, compute F in these points, test the
points'membership to £, and pass to the next iteration.
Fourth, the following natural approach (see Sobol and Statnikov (1981» may be used
for numerical construction of £: choose in 'X a grid EN with good uniform characteristics
(e.g. a nt-grid), compute F in the grid points, then construct (in a finite number of
comparisons) the set of effective points on EN that is an approximation of £, for large N.
Substitution of the set 'X by a finite number of points selected in a special manner is the
essence of this approach.
Fifth, using the results of Chapter 4 and 7, one can formulate random search
algorithms, where the prospectiveness of uniformly chosen points and corresponding
subsets of 'X is determined in a probabilistic, rather than deterministic manner - after
Random Search Algorithms 233

determination of confidence intervals for the values of fA,(x) minima, under randomly
chosen A,E Sm.
The following algorithm is simple, but it reflects the principal features discussed.
Begin with determining N independent realizations Xl ,... ,xN of the random vector
uniformly distributed in X; compute next the vector function F in these points and take ~
independent values A,(1)""'A,( 0 of the random vector with uniform distribulion on Sm.
Perform for all i=l, ... ,~ the following procedure: choose that point xi* of the obtained
ones where

f A. (x)
(i)
is maximal; construct a ball B(xi*,e) with the centre in xj* and radius 10 selected so that
only a small portion of the points Xlo ... ,XN belong to the ball; using (4.2.25) construct
the confidence interval of a fixed leve11-y for

min f A. (x)
XE'X\B( X~,E) (i)

using xjEX\B(xi*,e) (j=l, ... ,N); if

fA. (xP
(i)

falls into the confidence interval, then construct in a similar manner in the set X\B(xi* ,e)
the ball of the same radius until

falls into the last confidence interval. The union of all the balls constructed is regarded as
an approximation to the set of efficient points. This union can be considered as the
truncated search domain on which the same operation can be performed, but with smaller
e. For N, ~ ~oo and natural regularity (say, convexity) conditions on f i (i= 1,... ,m), one
can directly prove that the probability of missing global minimizer of a randomly chosen
f A(x) tends to a value not exceeding y.

6.2.3 Discrete optimization


Development and application of the general global search philosophy to discrete
optimization requires a more thorough consideration of the specific features of the feasible
set, than in the continuous case xc:R.n. In particular, the choice of a suitable metric (or
pseudo-metric) p on a discrete set X is usually not evident, not defined a priori. Metrics,
234 Chapter 6

then may be chosen in the course of the search being, by the following heuristic
considerations. For optimization of a function f, the pseudo-metric

P /z,x) = If(z) - f(x)l, x,y E X,

generated by I, is the most suitable one. The Lipschitz-constant of I in PI is 1; the ball

contains the points of X, where f (z) differs from f (x) by £ at most. The objective of
minimizing the Lipschitz constant leads to the following way of choosing a metric, best
fitted to a given function I, from a given set n={ P 1,... ,p I.} of metrics or pseudo-metrics:
fix a number k o ; normalize the metrics Pi so that the number of points in all the balls
B(x,£,Pi) of a fixed radius £ (say, £=1 ) approximately equals ko; estimate the Lipschitz
constant of f by (2.2.33) for each PiE n and use for the optimization that metric for which
the Lipschitz constant estimate is minimal.
The set n might contain metrics that are standard for the sets under consideration and,
if possible, pseudometrics Pg for functions g that are close in some sense to f (e.g.
estimates of f).
Let us demonstrate how the methods of branch and probability bounds may be used
for optimizing discrete functions. The only basic difficulty lies in the fact that the order
statistics apparatus cannot be formally applied. Indeed, since there is a positive probability
of getting on the bound (i.e. a global optimizer of f),there is no sense to consider
conditions like (a) of Section 4.2, as the corresponding limit simply does not exist.
Moreover, the order staistics of discrete distributions do not form a Markov chain, see
Nagaraja (1982). But for a very great number m of points of X (only this case is of
practical interest), one may assume that the probability of getting exactly to the bound is
negligible and that the discrete c.d.f. F(t)=Pr{ x: f (x)<t) is approximated, to a high
accuracy, by a continuous c.d.f. for which one can apply extreme order statistics theory
and, therefore, the apparatus of Sections 4.2 - 4.4 and of Chapter 7. This way, for very
large m the accuracy of statistical inference made under the continuity assumption
diminishes only insignificantly. Such an approach was applied in Dannenbring (1977),
Golden and Alt (1979), Zanakis and Evans (1981) (but, of course, these works do not use
the more up-to-date statistical apparatus described in Chapters 4, 7).
Now we shall tum to the study of relative efficiency of a discrete random search
following Ustyuzhaninov (1983) in the formulation of the problem.
Random Search Algorithms 235

6.2.4 Relative efficiency of discrete random search

Below a problem-type is considered, for which an exact result on the relative efficiency of
random search can be obtained.
Let 'X and Y be finite sets, 'X consists of m elements, f: 'X~ Y be an algorithmically
given function. A non-empty point set M(/) is associated with f, e.g. M(f) is a set of its
global minimizers. It is required to find a point XE M(n through sequential evaluations of
f. Function f is known to belong to a finite set 'F consisting of ~ functions. Thus, a table
with ~ rows and m columns is given in such a way that a function f E 'F corresponds to a
row and a point XE'X to a column: the value f (x) is the intersection of row f and column
X.
Consider now the scheme of random search algorithms solving the given problem.
An algorithm involves two stages. The first stage contains not more than sea)
iterations. At each k-th iteration either the transition to the second stage is established
(perhaps, at random) or a point of evaluating f is chosen, according to a (conditional)
probability distribution

At the second stage a point x is chosen that is thought to belong to M(/): herein a mistake
is possible.
An algorithm is called deterministic, if all indicated probability distributions are
degenerate (i.e. all decisions are deterministic), otherwise it is called probabilistic.
Define the problem Tt as a pair Tt=(F,n) consisting of a class of functions 'F and a
family of sets n={M(f), fE'F}. A deterministic algorithm d is called applicable to a
problem Tt if for any function fE 'F, it gives an element from M(f) without mistake.
Denote the class of all such algorithms as D(Tt). The maximal number N(d) of evaluations
of f needed using an algorithm d is called the problem hardness with respect to d. The
problem complexity is defined as

N D = min N(d).
dE~1t)

Let p(M(f) If,r) be the probability of that the application of an algorithm r to a function f
yields the correct solution. We shall call an algorithm r p-solves a problem Tt if
p(M(f) If,r)2:p for any fE'F. Denote by Rp(Tt) the class ofp-solving algorithms for a
problem 1t and set

Np= inf N(r).


fE R pC 1t)

Since D(Tt)CRp(Tt), thus Np:SND.


236 Chapter 6

According to Ustyuzhaninov (1983), we shall characterize the relative hardness of


random search for problem 1t by the ratio y= ND/( Np+l) which is called the problem
index. We shall call a problem 1t full, if for any unknown function feF and any collection
3={xI, ... ,xk} of points X, with the property 3()M(/)",0, given 3, f<Xi) (i=I, ... ,k) and
the fact that 3()M(f) ",0 an element XE 3()M(f) can be indicated. E.g., a full problem is
the search problem for a point in which the value of a multiextremal function is distinct
from a known minimal value not more than E.
Consider full problems 1t= 1tm in which classes f' consist of m different functions and
let O<p<l, q=l-p. Then Ustyuzhaninov (1983) showed that

o '5.y'5. (log (pm»/log(l/q) + 1/p (6.2.3)

for the index yof a full problem 1t. The inequality (6.2.3), however, is wrong. In order to
show this, it suffices to let p tend to zero on the right hand side of (6.2.3):

.
p~0
( log pm
- log (1 _ p) +
1)
p =- 00

This leads to the contradiction O.s-oo.


While proving (6.2.3), the following error was done: The estimate ND,Sy/(l-y),
trivially following from the equality ND=rCNp+1) and the inequality Np.sND, is valid
only for y<l, but Ustyuzhaninov (1983) used this estimate and applied for y.sl!p (thus
including cases where "(2:1).
Following the reasoning of the mentioned work, let us prove the correct inequality
with respect to y.
If O,Sy.sl-l!m, then y.sND,Sy/(l- y). The same holds for 0::;"«1, since the values of y
can not fall into the interval (I-1Im,I). Let "(2:1. Taking the valid inequality below from
Ustyuzhaninov (1983)

0'5. v '5. [Yl, (6.2.4)

choose v to make its right side minimal. The function

<p(v) = mq v1(1- v/y)


attains its minimal value for vo ="(+ l/log q. If q> 1/ i zO.368, then

y + 1/ log q < [y],


Random Search Algorithms 237

thus the minimum of <p is reached at the point Yo' In the inequality (6.2.4) set
v=[y+ l/logq], express this quantity as v=y+ l/logq-cr (where OScr<l), substitute it into
(6.2.4) obtaining NDsmq'Yy/x(cr), where

x( cr) = qcr-v log q( cr - 1/ log q).

The function X decreases on the interval [0,1), since X'(O)=O and X'(cr)<O for cr>O.
Therefore X(cr»X(l) for Oscr<1, whence NDsmq'Y 'YIX(l). Using this inequality and the
relation YSND~ one obtains 'YSK(q,m) where

K( q, m) =1- (log m)/log q - (1 + log « - log q)1 (1 - log q))) I log q.


(6.2.5)

If voSO, i.e. 'YS-1/log q, then the inequality (6.2.4) is equivalent to NDsm: this way
NDsm-1 for msq(1-1/logq). If q<l/ ~, then the restrictions on m are absent. Summing up
the above, the indices 'Y of full problems 1tm satisfy the following conditions:
a) 'Ysm-1 for q>l/e, msq(1-1/logq),
b) -l/log qS'YSK(q,m) for q>l/e, m>q(1-1/logq),
c) OS'YSK(q,m) for qs1/e
where K(q,m) is determined by (6.2.5).
As it follows from condition b), the maximal acceleration due to using random search
is s+l/s-l, and it is attained for q=exp(s/(s-l), where s is the solution of the equation
ms=exp( -s).
PART 3. AUXILIARY RESULTS

CHAPTER 7. STATISTICAL INFERENCE FOR THE BOUNDS OF


RANDOM VARIABLES

This chapter describes and studies statistical procedures having a significant place in the
theory of global random search (these procedures are included into some of the methods
of Chapter 4). Most attention is paid to linear statistical procedures that are simple to
realize.
Section 7.1 states the problems and considers the case, when the value of the tail index
of the extreme value distribution is known. This is a typical situation in global random
search theory, as follows by the results of Section 4.2.6.
Section 7.2 deals with the case, when the value of the tail index is unknown. Finally,
Section 7.3 investigates the asymptotic normality and optimality of the best linear
estimates.

7.1 Statistical inference when the tail index of the extreme value
distribution is known

7.1.1 Motivation and problem statement


Let an independent sample Y={Yl, ... ,YN} from the values of a random variable y with a
continuous c.d.f. F(t) be given, y being concenrtrated on an interval [L,M] where
-ooSL<M< 00. More precisely, let the upper bound

M = vrai sup y = inf {a:F(a) = I} (7.1.1)

of y be a.s. finite and consider statistical inference for M throughout the chapter. Statistical
inference for the lower bound L=vrai inf y under the supposition L>-oo are constructed
similarly or can be elementarily obtained from results related to M and hencc will not be
considered here.
Various approaches can be used for constructing statistical inference. In particular the
parametric approach is based on the supposition that an anlytical form of the c.d.f. F(t) is
accepted (identified), but some of its parameters are unknown being estimated from the
sample. This approach is of moderate interest in the context of global random search
theory and is not considered below.
The yearly maximum approach, thoroughly described by Gumbel (1958) and studied
in some works, the most valuable of which is Cohen (1986), involves the partition of the
sample Y of size N=h into r equal subsamples, and the estimation of the extreme value
distribution parameters as if the maximal elements of the subsamples have this
distribution, is generally inefficient as well. After realizing this inefficiency, many works
have been devoted to the problem: among them the work of Robson and Whitlock (1964)
was the first and most of them became known in the 1980's. These works (including
those of the present author) construct and study statistical inference about M, based on
using some k+ 1 elements of the maximal order statistics
239
240 Chapter 7

(where k =k(N) ~ 1, kIN ~O for N ~oo)

(7.1.2)

from the set H={Tll, ... ,l1N} of the order statistics derived from the sample Y, rather than
using the whole sample Y. The following arguments may be put in favour of this
approach: (i) according to a heuristic reasoning, the order statistics not belonging to
(7.1.2) are/ar enough from M and so not carry much valuable information concerning M,
(ii) the theoretical considerations presented below as well as the corresponding numerical
results show that if k is sufficiently large, then further increase of k (under N ~oo) may
lead to either an insignificant improvement or even the deterioration of the statistical
procedure precision (due to the inaccuracy of computations), (iii) using (7.1.2), the
asymptotic theory of extreme order statistics can be applied to construct and investigate the
decision procedures. We shall confine ourselves to this approach being a semiparametric
one (at present not seeing any satisfactory alternative).
The following classical result from the theory of extreme order statistics (see, for
instance, Galambos (1978» is essential for the theory presented later.

Theorem 7.1.1. Assume that the conditions below are satisfied:


a) the upper bound (7.1.1) is finite and the function V(v)=l-F(M-v- I ), v>O, regularly
varies at infinity with some exponent -n, O<n<oo, i.e.

lim V(tv)/V(v) = t- a
V---) 00

holds for each t>O.


Then

(7.1. 3)

where eN is the (1-1/N)-quantile of the c.d.f. F, i.e.

F(e N ) = 1- liN, (7.1.4)

for z < 0
(7.1.5)
for z ~O

The asymptotic relation (7.1.3) implies that (under N~oo) the sequence of random
variables (l1N-M)/(M-eN) converges in distribution to the random variable with the c.d.f.
(7.1.5) which is called the extreme value c.d.f. and together with'lloo(z)=exp{ -exp(-z)} is
Bounds of Random Variables 241

the only nondegenerate limit of the c.d.f. sequences for (llN+aN)/bN (where {aN} and
{bN} are arbitrary numerical sequences).
The parameter ex of the c.d.f. (7.1.5) is called the tail index (or the shape parameter) of
the extreme value distribution. This section deals with the case where condition a) of
Theorem 7.1.1 holds and the value of the parameter ex is known and Section 7.2 will treat
the case of unknown ex.

7.1.2 Auxiliary statements


In this chapter we shall often use the following well-known results. The first is the so-
called Renyi representation (see e.g. Galambos (1978»

(7.1.6)

where ~o,~l, ... ,~i are mutually independent exponential random variables with the
density e- x , x;:::O. The second is the asymptotic relation

-1 l/a
M-F (x)-(M-8 N )(-Nlogx) , N ~oo, (7.1.7)

due to Cook (1979), being a simple consequence of (7.1.3). Here the notation aN-bN
(N ~oo) means that the limit values

and limb
N~oo N

exist and are equal. (Here the convergence in distribution is considered, if {aN} and {bN}
are sequences of random variables.)
The next statement immediately follows from (7.1.6) and (7.1.7).

Lemma 7.1.1. If a) holds then for N~oo, i/N~O the asymptotic equality

(7.1.8)

is valid where ~o, ... ,~i are as in (7.1.6).


Note that according to classical result,the sum ~o+ ... +~i in (7.1.8) follows gamma
distribution with the density

x> O. (7.1.9)

Let us prove now two basic auxiliary statements.


242 Chapter 7

Lemma 7.1.2. Let assumption a) hold, a>l, N~oo, i2/N~O (for instance, i may be
constant). Then

M - Ell N_i - (M - 9 N)b i (7.1.10)


where
b. =
1
rei
+ 1 + l/a)/r(i + 1).

Proof. It is well known (cf. e.g. Galambos (1978» that the density of the order statistic
llN-i is

P N-i (x) = NC~_I~ -i-l(x)cp(x)(1 - F(x» i

where cp(x) is the density of the c.d.f. F(x). Then it follows that

Ell N-i = NC~_1 j xF N- i - 1(x)cp(x)(I_ F(x»idx .


--00

Substituting the variable y=F(x), we obtain

Using (7.1.7) we have for N~oo

where

. 1 .
11 = NC~_1 Jx N- i - 1(1_ x)ldx = 1.
o
i 1 lin. i
12 = NC N _1 J(- Nlog x) x N- 1- 1(l_ x) dx =
o

NIN lIn 1 lin. .


= '1 (N' _. _ 1)1 J(log (l/x» x N- 1- 1(l- x)ldx
1. 1 '0

Changing the variable y=log(l!x), we have


Bounds o/Random Variables 243

12 = 1.
N! N
Va
'I(N-'-l)1
-
Y e
Je -
i
1/ a - Ny( y 1) d
y.
1 . 0

Setting x=Ny and using the asymptotic equality

exp {x/N} - 1-x/N, (7.1. 11)

we obtain

12
(N-l)!
= i!(N-i-l)! OX
J- lIa e -xC e x/N - )
1 dx-
i

_ (N-l)! J-
. x
i+l/a
e
-x d - h b
x- ' N '
i!(N-i-lY,N 1 0 1, 1

where
N-l N-2 N-i
hj,N= ~"N"""" ... ~ ~1 (7.1.12)

for N~oo, i2/N ~O. Substituting the derived expressions for 11 and 12 we obtain
(7.1.10): the lemma is proved.
Note that results on the speed of convergence to the extreme value distribution are
contained in §2.1O of Galambos (1978), Smith (1982), Falk (1983): the results mentioned
show that in the case a~1 one has

instead of (7.1.10). This case is of minor interest and will not be considered.

Lemma 7.1.3. Let assumption a) hold. Then for a~l, N~oo, i21N~0, we have

(7.1.13)

for i.2j where

r(i + 1 + 2/a)ru + 1 + lIa)


(7.1.14)
"'ij = r(i + 1+ l!a)r(j + 1)

Proof. The joint probability density of'TlN-i and'TlN_j for i.2j is


244 Chapter 7

N-i-l i-j-l j
PN-i,N-/X, y) = AF (x)<p(x)(F(y) - F(x») <p(y)(1- F(y»)

where x~y and for brevity we use the notation

A = N!/(j!(N - i - 1)!(i - j - I)!).

Changing the variables similarly to those in the proof of Lemma 7.1.2, using the
asymptotic expressions (7.1.7), (7.1.11) and (7.1.12), introducing the notations
2
B = (M - eN) /(j! (i - j - I)!),

and integrating by parts at the end of the proof we obtain the chain of relations

E(T1 N-i - M)(T1 N_j - M) =


00 Y
= ldy l(x - M)(y - M)PN_i,N_/x,y)dx =

=A fl dv fV( F -1 (u)-M)( F-1 (v)-M) u N-I-(V-U)


. 1 i-j-l (l-v)du-
j
o 0
1 v
l/a l/a .. 1 .
- DJ dv J( log !) (log ~) u N- i- l (v - u) I-J- (1 - v)J du =
o 0
00 00 i-j-l j
= DJ dy J xl/ayl/a exp {- x(N - i)}(e- Y - e-x ) (1- e -Y) e -Y dx =
o Y
00 00 l/a l/a i
=Dfdvf(~) (~) e- U(exp{(u-v)/N}-I) x
o v
j
l_e-v/N ) 1 du
(
x e-v/N_e-ulN l-exp{-(u-v)/N} N 2 -

1
f
00 00 • •
j+lIa
-h. ~fdv v ulIa
e -u( u-v )I-J- du-
I, 0 V

- Bjvj+lIa dv j(z + v)l/a e-(z+v)zi-j-l dz =


o 0
Bounds o/Random Variables 245

= BJ Joov j+2/a e -V(1 +z/ v )1/a e-zzi-j-ldvdz=


OO

o0
foo 1/a i-j-l 2/a+i {
= Bf (1 + y) Y v exp - v(y + 1)}dvdy =
OO

o0

-Bf(1 )1a/ "I-J-ld f


00
x
2/a+i 00 dx
-x _ _ _
- +y y Y 2/ a+i e y+1-
o 0 (y + 1)

The lemma is proved.

The asymptotic equalities (7.1.10) and (7.1.13) were formulated by Cook (1979),
without proof and applying the inexact assumption i/N~O instead ofi2/N~0 for N~oo.
The inadequacy of the condition i/N~O (N~oo) follows by the fact that in this case
(7.1.12) does not hold. Indeed, ifN~oo, i/N~O, but i2/N~O not necessarily holds, we
have

h.
I,
N = IT(1 - j/N) = exp {
j=1
~log (1- j/N)}-
j=1

-exp {- .Ij/N}-exp { -i 2/(2N)}


J=1

instead of (7.1.12). So ifi2/N does not approach zero while N~oo then

lim h. N~1
N~oo I,

and the asymptotic equalities (7.1.10) and (7.1.13) are not valid. Instead of them the
asymptotic representations of the following corollary hold.

Corollary 7.1.1. Let assumption (a) hold, a>l, N~oo, i?j, i/N~O. Then
246 Chapter 7

In particular, if the limit value

exists and is finite then

M - ETl N_i - (M - 0N)exp{ -1/2}b i,

We shall omit further the multiplicator exp{-i 2/(2N)} or exp{-I/2} supposing i2/N for
N ~oo. Modifications of the statements given below for the more general case, when
i2/N~l<oo while N~oo, are obvious.

7.1.3 Estimation of M

We suppose again that condition a) holds and the value of the tail index ex>l is known.
Under these suppositions we consider below various estimates of the maximum (essential
supremum) M=vrai sup y of a random variable y. The estimates use the k+ 1 upper order
statistics (7.1.2) corresponding to an independent sample ofy.
The most well-known estimates of M are linear, having the form

(7.1.15)

Lemma 7.1.2 states that if a) holds, ex> 1, N ~oo and k2/N ~O, then

k k
EM N,k= i~oaiETlN-i = Mi~ai - (M- 0N)a'b +o(M - ON) (7.1. 16)

where a=(ao, ... ,ak)', b=(bo, ...,bk)' and

b.
1
= r(i + 1 + l/ex)/r(i + 1).
Since the c.d.f. F(t) is continuous, thus M.,0N and M-ON~O for N~oo. Using now
(7.1.16), the finiteness of the variances of TlN-i for i=l, ... ,k and the Chebyshev
inequality we obtain the following statement.
Bounds of Random Variables 247

Proposition 7.1.1. Let assumption a) hold, N-7 oo , k2/N -70.Then the estimate
(7.1.15) is consistent if and only if the equality

k
a'A = La. = 1 (7.1.17)
i=O 1

holds, where 1..=(1,1, ... ,1),.


Note that the statement is true for arbitrary a (not only for the case ex.> 1).
Lemmas 7.1.2 and 7.1.3 are immediately followed by the next statement.

Proposition 7.1.2. Let assumption a) hold, a>l, N-7 oo , k 2 /N-70. Then for
consistent linear estimates MN,k of the form (7.1.15), the asymptotic expressions

M- EMN,k- (M - eN)a'b, (7,1.18)


2 2
E(MN,k-M) -(M-e N) a'Aa (7.1.19)

are valid, where A is the symmetrical matrix of order (k+l)x(k+l) with elements Aij
defined for i~ by (7.1.14).
We refer to (7.1.17) as the consistency condition and to

a'b=O (7.1. 20)

as the unbiasedness requirement. Certainly, if (7.1.20) holds, then the estimate (7.1.15)
still remains biased, but for a>l its bias has the order O(N-l), as N-7oo•
Choose the right hand side of (7.1.19), as the optimality criterion for consistent linear
estimates (7.1.15) in the case a>1. The optimal consistent estimate MN,k * and the
optimal consistent unbiased estimate MN k + (the word consistent will be dropped) are
determined by the vectors '

a * = arg min a'Aa


a:a'A.=1
and

a+ = arg min a'Aa


a:a 'A. =1
a'b=O

correpondingly. The explicit forms of a* and a+ are

(7.1.21)
248 Chapter 7

(7.1.22)

From (7.1.21) there follows also

(7.1.23)

These expressions are easily derived viz. introducing Lagrange multipliers. (7.1.21) and
(7.1.22) are due to Cook (1980) and Hall (1982), respectively.
If k is not small enough, then the vectors (7.1.21) and (7.1.22) are hard to calculate,
since the determinant of the matrix A=Ak is almost zero. Namely, the following statement
holds.

Proposition 7.1.3. For ex>l, the weak equivalence

k ~oo (7.1. 24)

holds.
The proposition above will be proved in Section 7.3.2. Fortunately, simple
expressions for the components of the vectors (7.1.21) and (7.1.22) can be derived (see
Section 7.3.1). Using these, the components of a*=(ao*, ... ,ak*)' and a+= (ao +, ... ,ak+)'
can be easily calculated for any <DO and k=1,2, .... The following tables present them for
a=2,5,10 and k=2,4,6.

Table 3. Components of a*.

k ex ao* a1* a2* a3* 114* a5* a6* (a*),Aa*

2 2 1.636 0.273 -0.909 0.545


4 2 1.314 0.219 0.146 0.109 -0.788 0.438
6 2 1.157 0.193 0.129 0.096 0.077 0.064 -0.716 0.386
2 5 2.598 1.237 -2.835 0.384
4 5 1.811 0.863 0.719 0.634 -3.027 0.268
6 5 1.439 0.685 0.571 0.504 0.458 0.424 -3.082 0.213
2 10 4.246 2.895 -6.140 0.354
4 10 2.766 1.886 1.714 1.607 -6.972 0.231
6 10 2.092 1.427 1.297 1.216 1.158 1.113 -7.303 0.175
Bounds of Random Variables 249

Table 4. Components of a+.

k a ao+ al+ a2+ a3+ ~+ as+ a6+ (a+)'Aa+

2 2 2.000 0.333 -1.333 0.667


4 2 1.440 0.240 0.160 0.120 -0.960 0.480
6 2 1.224 0.204 0.136 0.102 0.082 0.068 -0.816 0.408
2 S 3.500 1.667 -4.167 0.518
4 5 2.117 1.008 0.840 0.741 -3.706 0.313
6 5 1.598 0.761 0.634 0.560 0.509 0.471 -3.533 0.236
2 10 6.00 4.091 -9.091 0.501
4 10 3.332 2.272 2.065 1.936 -8.606 0.278
6 10 2.377 1.621 1.473 1.381 1.315 1.265 -8.432 0.198

It is also proved in Section 7.3 that the optimal linear estimates are asymptotically
Gaussian and efficient. In particular, the following result holds (as being included in
Theorem 7.3.1).
Let the (somewhat stricter than a) ) condition

t~M, (7.1.25)

hold where 2~a<oo and Co is a positive number, N~oo, k~oo, k2/N~0. Then the
asymptotic normality relation

(M- M~.k)/O'N.k ~N(O,l) (7.1.26)

is valid where

if a> 2

if a =2
is the asymptotic mean square error of MN ,k * (i.e. E(MN ,k *-M)-O'N ,k 2 for N~oo,
k~oo, k2/N~0).
An analogous result is valid for the estimates MN k + and for related others. Thus
Csorgo and Mason (1989) show it for linear estimates determined by the vectors a with
components
250 Chapter 7

v.
1
for a> 2, i=O, ... ,k- 1
v k +2-a for a > 2, i = k
2/log(k + 1) for a = 2, i = 0 (7.1.27)
(log (1 + 1Ii»/log(k + 1) for a = 2,1 ::;; i ::;; k - 1
(logO + 11k) - 2)/log(k + 1) for a=2, i=k

where

v. = (a- l)(k + 1)2/a.-l((j + 1) 1-2/a. _ /-2/a.).


J

Hall (1982) does the same for the maximum likelihood estimates that are determined by
(4.2.22) and (4.2.23).
For practical use, the very simple estimate

may be also recommended where Ck=bo/(bk-bo) is found from the unbiasedness


condition (7.1.20). For large values of a (this case is of great practical significance)

and so

r(1) + a-1r'(1) a + 'JI(1)


C - = ~---,-'-----,--.
k 1 + a-1'JI(k + 1) - reI) - a-1r (1) 'JI(k + 1) - 'JI(1)'

where 'JI(.) =r'(.)tr(.) is the psi-function and -'JI(1)::::0.5772 is the Euler constant.

7.1.4 Confidence intervals/or M


As it was indicated in Section 4.2.3, to construct confidence intervals for M one can use
the asymptotic normality of the mentioned estimates or the Chebyshev inequality that
together with (7.1.19), (7.1.25) give (4.2.24). The only point to be investigated here is
the estimate (4.2.20) of co.

Proposition 7.1.4. Let (7.1.25) hold, N~oo, k~oo, k2/N~0. Then the estimate

(7.1.28)

for the constant Co is consistent, asymptotically unbiased, and


Bounds of Random Variables 251

(7.1.29)

Proof. Using (7.1.25) and Lemma 7.1.1 we have

for N--+oo and each iSk, where ~0'~1> ... are independent and exponentially distributed
with the density e- x , x>O. Therefore

k --+00.

Analogously

3 2 1 ooJ k-2 2 (2 ) 2 2
e dt-kc o = k /(k-1)-k co-co'
-t
-k cOk! t k --+00.
o
The consistency of (7.1.28) immediately follows from (7.1.29) and the Chebyshev
inequality. The proposition is proved.
An alternative way of estimating Co is due to Hall (1982) and consisting in setting

(7.1.30)

where
"M
is an estimate for M. If
"M
is the maximum likelihood estimate (MLE) for M, then (7.1.30) is the MLE for co.
The above approach to construct confidence intervals for M can be used only, ifN is
so large that k also can be chosen large enough. For moderate values of N, this approach
is not suitable and another one, due to Cook (1979), Watt (1980) and Weissman (1981)
can be recommended. It is based on the following statement which we prove since the
above references do not contain the proof.
252 Chapter 7

Lemma 7.1.4. Let assumption a) hold, N~oo, k be fixed. Then the sequence of
random variables

D N,k = (M -11 N)/(11 N -11 N-0 (7.1. 31)

converge in distribution to the random variable Xk with c.d.f.

ak
F k(u) =1 - (1- (u/(1 + u» ) , u~o. (7.1. 32)

Proof Set w=(1 + lIu}<l·-l and note that w;:::O.


Using (7.1.8) and (7.1.9) we obtain

= r(~) Jdx J exp {- x - t}tk-Idt =


o wx

J
oo k 1
= k Y - (y + 1)
-k-l k
dy = 1 - (w/(w + 1» .
w

The lemma is proved.


The o-quantile rk,o of the c.d.f. (7.1.32) as determined from Fk(Tk,o)=O, is easily
derived:

This way the confidence level of the interval

(7.1.33)
Bounds of Random Variables 253

for M asymptotically equals 02-01 where O~01<02~1 (for N~oo, k/N~O. In many
applications (including global random search theory), the one-sided confidence intervals

[11 N,ll N + r k,l-rC ll N -11 N-J] (7.1. 34)

for M (which can be obtained from (7.1.33) by setting 01=0, 02=I-y) are most n,aturally
used. Let us investigate their average length in order to conclude on the necessary number
of the order statistics l1N-i.

Proposition 7.1.5. Let condition a) hold, N~oo, k2/N~O, and yE (0,1) be a fixed
number. Then the average length of the confidence interval (7.1.34) for M asymptotically
equals (M-8N)<P(k,y) where

n k + 1 +lIa) ] lIa
<p(k, y) = r k,I-y [ nk + 1) - nl + lIa) ~ ( - log y)
(7.1.35)
for k ~oo.

Proof. The average interval length equals

Er k,I-y(l1 N -11 N-J = r k,I-rC Ell N - El1 N-J-

k ~oo.

Using the Stirling representation, we have

II a
nk + 1+ l/a)/ nk + 1)-k , k ~oo.

Consequently, for k~oo one obtains

-11 a ) lIa
lIa ( 11k 11k
cp(k,y)-k / (l-y) -1 -(k(l-y )) -

lIa
1- yllk) ') 11k Iia lIa
- ( (l/k) , =(-y logy) -(-logy) .

The proposition is proved.


Numerical results seem to indicate that cp(k,y) differs from its limit value (-logy)l/a
insignificantly, even for rather moderate values of k (e.g. for kE [5,7] ) and for k""10
almost reaches it. This way, even for very large N, the choice k=lO is already in good
254 Chapter 7

match with the asymptotic requirements k~oo, k/N~O (N~oo). The numerical results
also demonstrates that the convergence rate in (7.1.35) increases if y decreases.
Note that analogous conclusion about the selection of k, via numerical analysis of the
two-sided confidence intervals (7.1.33), were drawn by Weissman (1981) who did not
deduce asymptotic expressions similar to (7.1.35).

7.1.5 Testing statistical hypotheses about M


Consider the problem of testing the statistical hypothesis Ho: M.~K versus the alternative
HI: M<K, where K is a fixed value (K>llN)'
To accomplish a test, one can construct a one-sided confidence interval for M of a
fixed confidence level l-y, following Section 7.1.4, and reject Ho if K would not fall into
the interval. We shall investigate below the test determined by the rejection region

(7.1.36)

which corresponds to (7.1.34). According to this test, Ho is rejected if


(K-llN)/(llN-llN-k)~rk,l-y, and accepted otherwise. This test was proposed and
investigated by Cook (1979) for the particular case k=l, note, however, that Cook's
results need some correction. Below we approximate the power function
J3N(M,y)=Pr{YEW} of the test. Set

T(u,v) = f(u + 1, v)/f(u + 1)


where r(.) is the gamma function,

f(u + l,v) = f tUe-tdt


v

and introduce the abbreviations

K= (K - M)/(M - eN)' z = r k,l-y'


8 = ay ( y-11k - 1) , (a)+ = max{O, a} for any a E :R..

Theorem 7.1.2. Let condition a) hold, a>O, N~oo, k/N~O. Then


1) J3N(M,y)-aN(M,y) for N~oo, where
Bounds o/Random Variables 255

2) ~N(K;y)~y for N~oo,


3) aN(M,y) is a decreasing function of M,
4) the asymptotic equality

lin
~N(M,y)-l-(1-y)'J(k,A)+A 8f'(k+ l-l/a,A)/f'(k)

is valid for N~oo and M<K,


5) for N~oo and M>K we have

Proof. Introduce the notations

Using (7.1.8) one obtains for N~oo the chain of (approximate) relations
256 Chapter 7

1 k-l
= r(k) J (Y1-Y o) exp{-yl}dYodyl=aN(M;y).
D2

This is the first statement of the theorem.


The second assertion follows from the first one and the third is evident. Let us
consider now the fourth statement.
Note that in the case A~O for N~oo, K>O for M<K and

a.( a.)k-l
0= a(z/(l + z» 1- (z/(l + z» .

Thus for N~oo we obtain

k. . 1
=l- k1!.LC Jk(-l) J- (z/(l+z» u. Jy ke -Y( l-(A/y) 1/ a.) a.jdy-
",)·00

J~ 1

I k j j-l
-1- k!.LC k(-l) (z/(l+z»
a.j
Jyk e
00 _ (
Y
1/ a.
1-ajA
1/)
y- a. dy=
pI 1

1/a.
= 1- (1- 'Y)T(k,A) + A Or(k + 1 + lIa,A)/r(k).

Applying the first assertion, we arrive at the desired result.


To prove the fifth statement, note that K<O for M>K and then proceed analogously: on
obtains
Bounds o/Random Variables 257

1[
=k! J
00
y~-Ydy-.I,(-l)
k j-l j
C k(z/(1+z))
aj
x
Il J=1
(-lC)

a) 1 k j j-l aj
-'T ( k,( - K) - kf I,C k( -1) (z/(l + z)) x
j=1

The theorem is proved.


The second and third assertions of the theorem show that the probability of the first
type error of the test asymptotically does not exceed "(.
Let us investigate the asymptotic behavior of the second type error probability for
k---+oo which equals l-~N(M,"() where M<K.

Lemma 7.1.5. If k---+oo, then

(7.1.37)

for each "(E (0,1).


The proof is a straightforward application of the l'Hospital rule.
Lemma 7.1.6. If c>l, then

lim 'T(k,ck) = O.
k--+oo
258 Chapter 7

Proof. Represent T(k,ck) as

k)k+l
= k!1 ckJYke -Y dy =
co (
c
T(k,ck) k' L

where
co

J
1= exp {k(log t - ct )dt }.
1

We shall apply the saddle-point approximation to the integral I. The function logt-ct attains
its maximal value (-c) at the interval [1,00) at t=l. This way,

I-exp {- kc}/(k(c - 1) for k 4 00.

Applying the Stirling approximation, one obtains for k400

T(k,ck) -ck+1exp {- k(c - 1)}/( -v'21ti(c - 1») 40.

The lemma is proved.

The fourth statement of Theorem 7.1.2 is followed by the asymptotic inequality for the
probability of the second type error

1 - ~ N(M, y) < (1- y)T(k,A), N4 00. (7.1. 38)

(7.l.37) gives

T(k,A)-T(k,k( K cx /( -logy»)) for k 4 00. (7.1. 39)

Since K=(K-M)/(M-9N)4 00 for N4 00 , thus for sufficiently large N we have


c=K(l/(-logy»1. Lemma 7.l.6 together with (7.1.38) and (7.1.39) shows that the second
kind error probability approaches zero, if k4 00 while N400• At the same time, numerical
results demonstrate that the choice k= 10 is already suitable in most practical cases.
Let us note finally that the power function of the test with the rejection region

applied for testing the hypothesis Ho: M<K versus HI: M~K is representable as
l-~N(M,l-y), and so can be approximated using the above formulas.
Bounds of Random Variables 259

7.2 Statistical inference when the tail index is unknown


We shall suppose in the sequel that condition a) of Section 7.1 holds, but the value of the
tail index ex. is unknown. This case is typical for many classes of optimization problems
(e.g. when a discrete problem is approximated by a continuous one) as well as in some
other applications. As above, we confine ourselves to statistical inference that lise only the
first k+ 1 elements of the extreme order statistics (7.1.2), corresponding to the independent
sample Y. Unlike in Section 7.1, a satisfactory precision of the statistical inferences can be
guaranteed only if k is large enough, i.e. k=k(N)~co for N~co (while maintaining
k/N~O).
Subsection 7.2.1 presents statistical inference procedures for estimating M; the main
point in their construction, the estimation of ex., is considered in Subsection 7.2.2.
Subsection 7.2.3 deals with the construction of confidence intervals and statistical
hypothesis testing for ex. that can be useful in some global optimization problems
(investigating whether the objective function attains its maximal value inside a subset of
the feasible region, recall Sections 4.2 and 4.3.)

7.2.1 Statistical inference for M

An ordinary way of drawing statistical inference concerning M, when the tail index ex. is
unknown, consists of the substitution of some estimator ~ of ex. for ex. into the formulas
determining the statistical inference for the case of known ex.. Obviously, the accuracy of
such statistical inference is the main problem arising here. The most advanced results in
this field were obtained for the case, when ex. and M are estimated by the maximum
likelihood technique: below we shall state some of them.
First let us follow Hall (1982) to construct maximal likelihood estimators for M and
formulate their properties.
Suppose that instead of the asymptotic equality (7.1.25), the relation
a.
F(t)= 1- co(M-t)

takes place for each t in some interval [M-8,M], where co>O, ex.22, M are unknown
parameters, and the order statistics TlN, ... ,TlN-k fall into the interval [M-8,M]. Under
these suppositions the likelihood function is
260 Chapter 7

Maximizing this function with respect to M, Co and a, one obtains the maximum
likelihood estimators
1\
M,eOand&:
1\
expressed as follows: M is the minimal solution of the equation

(k [ ~l
. 0
J=
J
k~]
+ 1) 1/ L log (1 + 13.(~)) - 1/ L 13 .(~1) = 1
. 0 J
J=
1\ 1\
provided M ~l1N (if the solution does not exist, then llN is taken as M),

k-l
&: = (k + 1)/ L log (1 + 13 .<M)), (7.2.1)
j=O J
and

in the above formulas the abbreviation

is used.
1\
Hall (1982) proved the asymptotic normality of the obtained estimate M, with mean
M and variance (a-1)2O"N,k2 , where

for a>2
(7.2.2)
for a=2,

and the asymptotic normality of &: with mean a and variance

(7.2.3)

provided that the stronger assumption on F:

for t-+M (7.2.4)

holds, where bO, a~, and


Bounds of Random Variables 261

k =k(N) ~oo, for N ~oo (7.2.5)

~='Y/('Y+l/2),,¥=min{l, Ua}.
Smith (1987) used a different approach to construct the maximum likelihood
estimators. To describe it, let us introduce first the so-called generalized Pareto
distribution by

r1- (1- vt/cr) l/v for v:¢:. 0


G(t;a, v) =i (7.2.6)
ll- exp {- t/ a} for V= 0

where cr>O, O<t<oo for v~O and O<t<cr/v for v>O. Now let y be a random variable with
c.d.f. F(t), upper bound M~oo and let h<M. Then

F h (t) = (F(h + t) - F(h»/(1 - F(h», O<t<M-h (7.2.7)

is the conditional c.d.f. of y-h given y>h. Pickands (1975) showed that (7.2.6) is a good
approximation of (7.2.7), in the sense of the relation

lim sup IFh(t)-G(t;cr(h),v)I=O (7.2.8)


h~MO<t<M-h

for some fixed v and function cr(h), if and only ifF is in the domain of attraction of one of
the three limit probability laws (namely,

'¥ a(z) = exp (- ( - z) a) for z < 0, for z > 0,

».
and A(z)=exp(-e- Z In case of c.d.f. '¥a, the constant V in (7.2.6) equals l/a.
Now the approach of Smith (1987) is as follows. Let N be sufficiently large,
Yl, ... ,YN be independent realizations of the random variable y with c.d.f. F(t), h=h(N) be
a high threshold value, k be the number of exceedances of h, and Xb ... ,Xk denote the
corresponding excesses. That is, Xi=Yj-h where j=j(i) is the index of the i-th exceedance.
Under fixed N, the excesses xI, ... ,xk are independent and have the c.d.f. (7.2.7).
Relying upon (7.2.8), the generalized Pareto c.d.f. G(t;v,cr) is substituted for (7.2.7) in
the construction of the likelihood function. This way, its maximization yields the maximal
likelihood estimators
f'r N and {) for cr and v,
respectively. The corresponding estimator for M is
h+ f'r N /{) N'
Smith (1987) extensively studied the asymptotic properties (including the asymptotic
normality and efficiency) of these estimators for M and v=l!a) under fairly general
262 Chapter 7

conditions on F. Smith's results cover the case 0< a <2, together with the cases of a,2:.2
and the other two limit laws for the extremes.
To construct confidence intervals for M and to test statistical hypothesis about M in the
case of unknown a, one can use the above mentioned results of Hall and Smith,
concerning the asymptotic normality of the maximal likelihood estimates of M. Recall
again that, generally, this approach is applicable in the case when N is very large. The
alternative techniques of de Haan (1981) and Weissman (1982) seem more suitable, if k
can not be chosen to be very large (this holds e.g. for moderate values of N, say
N",,100+200). De Haan proved that for N~oo, k~oo, k/N~O, the test staistics (4.2.34)
converge in distribution to the standard exponential random variable with density e- t , t>O.
(Similar test statistics were considered by Weissman.)

7.2.2 Estimation of a

The estimation of the tail index a is a major task in drawing statistical inference about M.
It is important also in some other tail estimation problems and is often stated not only in
connection with the extreme value distribution 'Pa but including all three extreme value
distributions (for references, see Smith (1987».
A number of estimators of a are known, cf. Csorgo et al (1985), Smith (1987), the
above mentioned maximum likelihood estimators as well as the formulas

(7.2.9)

and

(7.2.10)

where N~oo, m~oo, k~oo, m<k, k/N~O. The estimator (7.2.9) was proposed by
Pickands (1975) and thoroughly investigated e.g. by Dekkers and de Haan (1987).
(7.2.10) was proposed by Weiss (1971) who formulated also some of its asymptotical
properties. Below we derive some more general results concerning (7.2.10) and modify
it, to reduce its bias.

Theorem 7.2.1. Let condition a) of Section 7.1 hold, a>1, N~oo, k~oo, k/N~O,
m/k~'t where 0<'t<1. Then the estimator (7.2.10) for a is consistent, asymptotically
unbiased, and there holds the relation

k~oo (7.2.11)

where
Bounds ofRandom Variables 263

v't = (1- 't)/( 'tlog 2't),


u't = (9 + 20(1 - 't 2)Jog 't + 1( 2 + 't + 2't2)Jog 2't) I( 't 2 log 4't).

Proof. According to (7.1.8), M-T1i-(M-eN)~I}la for i~k, N~oo where the random
variable Ili has the density xie- X /r(i+1). Using this approximation we have for N-+oo,
k-+oo,knN-+O,m/k-+'t

1/a 1/a
-I (Il/k) - (1l0/k)
= (- log 't)E log 1/a 1/a -
(Il m /k ) - (1l0/k)

ooJoo( I( t» exp {- t - s}sm k-m-I


-a(-log't) J log- 1+"8 I(k- _1)lt dtds=
00 ~ m .

= a( - log 't)k!.s /(m! (k - m - 1)!)

where for m/k-+'t


00 k-m-l 00

.s =JC 1 e- kt (e t -1) dt-.s 1 = Jg(t)exp{ks(t)}dt,


o 0

get) = cl, set) = (1- 't)log(e t - 1) - t.

By the Stirling approximation, for k-+oo, m-+oo

k! -(k/2 )112 -m-1/2(1_ )-k+m+1/2


m! (k - m _ 1)! 1t 't 't .

By the saddle-point approximation

where to=log(l/'t) is the maximizer of g,


264 Chapter 7

g(t 0) = (1- t)log (1 - t) + dog t, s· (to) = - t/Cl- t).


Consequently, we have

this gives the asymptotic unbiasedness of (7.2.10). Analogously,

-a 2(k -m)C mS
k 2

where
00 2
J
S 2 = h(t)exp {ks(t)}dt, h(t) = (1 + C 1 log t) .
o
The saddle-point approximation gives for S2:

S2-k -1/2 exp { ks ( to)} I,a.k-j 00

j=O J

l
where

= ru + 1/2) (J!...)2j[ (S(t o) - s Ct)]-j-


lI
a. rC2' 1) d h(t) 2
J J+ t (to-t)
t=t o
Since hCto)=O, thus ao=O. The expressions for a 1 and a2 are

312
a 1 = V21t(lIt - 1) I log 2t,
Bounds ofRandom Variables 265

These expressions lead to (7.2.11) which in its tum implies the consistency of (7.2.10).
The theorem is proved.

Note that the function v't=(1-'t)/Ctlog 2 't) attains its minimal value (",,1.544) at
't o ",,0.2032: therefore to approach the minimal asymptotic variance of the estimator
(7.2.10), one has to choose m-k/5 for k~oo.
Comparing (7.2.11) for 't=0.2 and (7.2.3), one can deduce that for a2:.2.25 the
estimator (7.2.10) of the tail index a has smaller asymptotic variance, than the maximum
likelihood estimator (7.2.1).
Naturally, the estimator (7.2.10) would be better if the exact optimum value M were
substituted for l1N. Since M is unknown, l1N replaces it in (7.2.10): this enters a bias into
the estimate, for any fixed k and m. To reduce the bias, let us use the estimate

presented at the end of Section 7.1.2, where r is either k or m and

ar = (& + ",(1))/(",(r + 1) - ",(1». (7.2.12)

Substituting MN k O and MN m O for l1N into the corresponding arguments of (7.2.10),


we obtain the modified estimator

",(m + 1) - ",(1) ",(k + 1) + & log (kIm»)


(
(log k/m)1 log ",(k + 1) - ",(1) + log ",(m + 1) + & + &:
(7.2.13)

where & is the Weiss estimator (7.2.10). The numerical investigations indicate that for
moderate values of k, the estimator (7.2.13) is more accurate than (7.2.10) and the
accuracy difference increases, if a increases.

7.2.3 Construction of confidence intervals and statistical hypothesis tests for ex

For suitably large Nand k, in constructing confidence intervals and testing statistical
hypothesis for a one can apply the results of Section 7.2.1 on the asymptotic normality of
maximum likelihood estimators. We shall consider another approach based on the
asymptotical properties of the estimator (7.2.10) which seems to be applicable also for
moderately large values of k.

Proposition 7.2.1. Let the conditions of Theorem 7.2.1 be fulfilled. Then the sequence
of random variables
266 Chapter 7

XN,k = exp { - (alog t)/tx}, (7.2.14)

where tx is defmed by (7.2.10), converges in distribution to the random variable with the
c.d.f.

o for t :$; 0

k-m-1 . .
Fk(t) = t m+ 1 L C1m+l.(1- d for 0 < t < 1 (7.2.15)
i=l

1 fort~1

Proof. We obtain from (7.2.10):

for N --+ 00, k --+ co.

Using this and (7.1.8) we get

Pr{XN,k<t}-Pr{VN,k<t}-Pr ~~a_~6a <t { ~~a_~~a lIa} -

- Pr{~l~m < t} = '(k _1 _ 1)'


m. m .
If
y~O,P-O
exp {- z - y}zmyk-m-1dzdy =
l+y/z~l/t

k!
= m' (k - m _ 1)' f x -k-l (x -
DO

1)
k-m-l
dx
. 'l/t

where as earlier ~i=~o+' .. +~i. Multiple integration by parts gives (7.2.15): the
proposition is proved.
The statement implies that the asymptotic level of the confidence interval (0~y~8~1)

(7.2.16)

for a equals 8-y, where tk,8 denotes the o-quantile of the c.d.f. (7.2.15): for illustration,
some quantile values are given in Table 5.
Bounds of Random Variables 267

Table 5. Some quantiles tk,8 of Fk.

k
5 10 20 50

0.025 0.052745 0.066739 0.086572 0.115266


0.05 0.076440 0.087264 0.104081 0.128558
0.1 0.112235 0.115825 0.126925 0.144980
0.9 0.583890 00449603 0.360662 0.291297
0.95 0.657409 0.506902 00401029 0.315597
0.975 0.716418 0.556096 00436614 0.337182

In a standard manner the confidence interval (7.2.16) may serve as the base for
constructing the statistical hypothesis tests concerning a. The rejection region of the test
for the hypothesis Ho: a~ao against the alternative HI: 0.< 0.0 is

The power function for this procedure can be written as follows

for k~oo.

Analogously, the rejection region of the test for the hypothesis Ho: 0.=0. 0 versus
Hl:a=a1>ao can be

w = { Y: & > (a olog 't) /log t k,Y}


For this test

for k ~oo.
268 Chapter 7

7.3 Asymptotic properties of optimal linear estimates


We suppose again that the condition a) of Section 7.1 holds and that the value of the tail
index a is known. We shall prove the asymptotic normality and efficiency of the linear
estimates, and derive expressions for the components of the vectors (7.1.21) and
(7.1.22).

7.3.1 Results and consequences


The main purpose of this section is to state two important theorems.

Theorem 7.3.1. Let the condition a) of Section 7.1 hold, N~oo, k~oo, k2/N~O, a>l,
and ex. be known. Then for optimal linear estimates MN,k, determined by the vectors
(7.1.21) and (7.1.22), the asymptotic equality

(7.3.1)

holds, where
2
(M-O N ) (1- a./2)/r(1+2/a) for1<a<2
2 2
O"N,k= (M-O N) (l_2/a)k-(l-2ICX) for ex. > 2 (7.3.2)
2
(M - ON) I log k for a = 2

and for a~2 the convergence in distribution holds

(7.3.3)

i.e. the sequences (M-MN,k)/O"N,k are asymptotically Gaussian with zero mean and unit
variance.
The theorem will be proved in Section 7.3.3. (The proof was done in collaboration
with M. V. Kondratovich.)

Theorem 7.3.2. The components ai* and ai+ of the vectors (7.1.21) and (7.1.22) are
representable as

a~
1
=u./A
1
for i =O,l, ... ,k,
a~=u'/(A-B)
1 1
for i=O,1, ... ,k-1,
at =(Uk - B)I(A - B)
Bounds o/Random Variables 269

where a>O,

U o= (a + l)/r(1 + 21a),
u i =(a-l)r(i+ 1)/r(i+ l+21a) for i= 1, ... ,k-l,
uk = - (ak + 1)r(k + lIr(k + 1 + 21 a),

A = {< ~(k + 2)/r(k + ! + 2/n) - 2/r(! + 2/n»1 (n - 2) for a> O,a;(: 2

L lI(i + 1) for a = 2,
i=O

and also A = U o+ ... + uk'

B = r(k + l)/r(k + 1 + 2/a).

The theorem will be proved at the end of Section 7.3.2.


In the case, when the condition (7.1.25) is fulfilled, being somewhat stricter than a),
the expression (7.2.2) resulting from (7.3.2), furtherly, (4.2.38) hold instead of (7.3.2).
(7.2.2) also occurs in the works of Hall (1982), Smith (1987) and of Csorgo and Mason
(1989), in connection with the maximum likelihood estimators and the linear estimators
determined by the vector with coefficients (7.1.27). Thus, under suitable conditions, the
asymptotic mean square errors of the above estimates coincide. Combining this with the
result of Hall (1982) on the asymptotic efficiency of the maximal likelihood estimators in
the class of asymptotically unbiased and asymptotically Gaussian estimators under a flxed
increase rate of k=k(N), we can conclude that the optimal linear estimators MN,k* and
MN k + are asymptotically efficient under Hall's conditions. Note that these conditions,
nam'ely, (7.2.4) and (7.2.5), are generally stricter than ours and that similar results on the
asymptotic normality of the maximum likelihood estimates for M are obtained by Hall
(1982) and Smith (1987) for the case when a is unknown. In essence, the results for
unknown values of a ;:::,2 coincide with the above results for the case when a~2 is known,
after the substitution ( a-I )2<JN ,k 2 for ON,k2.

7.3 .2 Auxiliary statements and proofs of Theorem 7.3.2 and Proposition 7.1.3.

In this section all matrices have the order (k+l)x(k+l), all vectors belong to :Rk +1,
A=(I,,,.,1)', I.I
denotes the determinant, and the abbreviation

r .. =r(i+l+j/a) (7.3.4)
1,J

is used.
270 Chapter 7

Lemma 7.3.1. Let z, do, ... ,dk be vectors in Rk+l and D=II do, ... ,dk II be a
nondegenerate matrix. Then

(7.3.5)

Proof Set x=D-1z. Then Dx=z and Cramer's representation gives

where

This way, we have

k
A'D-1z=A'X=.L IDiI/IDI.
1=0

Defme the matrix

D z = II d 0 + Z, •.. , d k + z I

and transfonn its determinant:

k
I
+ do, dl'd Z + z, ... ,d k + zl = ... = IDI + .L ID J
1=0

Hence

k
A'D-1z = i~olD dII DI = ID zll IDI- 1. (7.3.6)

Tranfonn the detenninant I Dz I in another way, subtracting the previous column from
each one, beginning with the last column:
Bounds of Random Variables 271

ID zl = Id o + z,d I - dO' ... ,d k + dk_II =


= IdO' d l - dO' ... ,d k - dk_II + I z,d I - dO' ... ,d k - dk_11 =

= IDI + I z,d l - dO' ... ,d k - dk_II·

This, together with (7.3.6), gives the desired relation: the lemma is proved.

Lemma 7.3.2. Let the vectors x=(xo, ... ,xk)' and Y=(Yo, ... ,Yk)' consist of positive
numbers and the elements dij of the symmetrical matrix

be defined by dij=XiYj for i~. Then

where

Proof Multiplying the last but one row by xk/xk-l and subtracting it from the last one we
obtain

Xk_IY 0 xk_IY 1 Xk_1Y k-l xkY k-l


XkY o xkY I xkY k-I xkY k

xk_IY 0 xkY o
x k_1Yk xkY I

= = IlJDk_II·

o 0 0 Ilk

The lemma is proved.


272 Chapter 7

Consider now the detenninant

where z=(zo, ... ,zk)'e :R.k + 1 and do, ... ,dk are the columns of the matrix Dk defined in
Lemma 7.3.2.

Lemma 7.3.3. There holds the relation

(7.3.7)

where

j.l..=x.(x.
1 1 1-
IY·
1
- x 1. y 1-
. I)/x.1- l'
<p.=z.-z.
1 1 1-
Ix./x.
1 1-
l'

v.1 = (x.1-IY·1- 2- x 1-
. 2Y·1- I)(x.1 -x.1-I)/(x.1-I-X.1-2)'
o= x o( x 1Y0 - x 0 Y0) I ( x 1(X 0 Y1 - x 1Y0) ).

Proof. Multiplying the last but one row by xIJxk-l and subtracting it from the last one, we
obtain

where

(XI-XO)Y O (X 2 -X 1)Y O (Xk-Xk_I)Y O


fl' = X1(Y 1 - yo) (X2 - X1)Y 1 (Xk - Xk-1)Y O
k ................................................................... .

Multiplying the last but one column of flk' by (xk-xk- OI(xk-l-xk-2) and subtracting it
from the last one, we have
Bounds of Random Variables 273

Since ~ll=l.lIO, thus ~k'=OI.lIV2 ...Vk.


The equality (7.3.7) will be proved inductively. Its validity for k=l can be checked
immediately. Suppose now that it is valid for the determinant ~k-l and show it for ~k:

The lemma is proved.

Lemmas 7.3.1 - 7.3.3 will be used for investigating the case, when k~2, (1)0, and
vectors x and y consist of the numbers

x.I = r.1,2/r.1, l' y.=r./r·


J J, J, o
(7.3.8)

where i,j=O,I, ... ,k and the symbol ri,m is determined by (7.3.4). In this case (7.1.14)
for i2j defines the elements of the matrix Dk which will be denoted by A or Ak. (The
matrix A is the same as in Proposition 7.1.2.)

Lemma 7.3.4. There holds for (1)0

r( (1nk + 2)/r k,2 - 2/r 0,2) I( (1- 2) for (1:;t: 2

t.:A -11.. = 1i~O 1/ (i + 1) = 'I'(k) - '1'(1) for (1 = 2


(7.3.9)

where 'I' is the psi-function.

Proof. According to Lemma 7.3.1, we have A.'A-IA.=~k/l A I where the first column of
the determinant ~k is z=A.=(1, ... ,l)'. By Lemma 7.3.2, we know that
274 Chapter 7

Taking into account (7.3.8) and that Z==A, simplify the expression for Ili' <Pi, vi (i==O, ... ,k)
and 0 of Lemma 7.3.3:

Il. = 1. 2/(a2(i + 1/a)2r(i + 1)),


1 1,

<po1 =-l/(a(i + lIa»,


v. = - r(i + 2/a)/( a 2(i + l/a)(i - 1 + l/a)r(i),
1

0= a(1 + l/a)/(l + 2/a).

This way,

V./Il. =-i(i + l/a)/(i +2/a)(i -1 + lIa»,


1 1

j j
( - 1) <p·n (v .Ill.)
J =2 1
= r.J"Orl 2/(a + 1)1.J, 2)·
i 1

Hence, (7.3.7) yields

-1 k
A'A A= I,r.o/r.2·
i=O 1, 1,

Now, (7.3.9) can be inductively deduced from this relation: the lemma is proved.

Lemma 7.3.5. For b==(bo, ... ,bk)', where bi==ri,l tri,o, we have

-1
A:A b = r k,l / r k,2· (7.3.10)

Proof By Lemma 7.3.1, A'A- 1b==Llld IA I, where the fIrst column of Llk equals z==b. The
expressions for Ili, vi, 0 are as in the proof of Lemma 7.3.4 and

Hence (7.3.7) gives

k
A'A -l b = r 0 l/ r 02 - a-I I,r(i + lIa)/r. 2·
, , i=1 1,

By induction, we obtain (7.3.10): the lemma is proved.


Bounds ofRandom Variables 275
Lemma 7.3.6. These holds

(7.3.11)

Proof. Let us represent vector b as b=BA, where B is the diagonal matrix with the
diagonal elements bo, ... ,bk. We have

-1
b'A -l b = A'BA -IBA = A'(B- 1AB-1) A.

The matrix Dk=B-IAB-I is symmetric and its elements equal dij=Xi'yj' for i~, where

2
x'.1 =x./b.
1 1
=r.1,2r.1,o/r.1, I' y'. = y ./b . = 1.
j j j

Analogously with the proof of Lemma 7.3.4, we obtain

The lemma is proved.


Let us deduce now Proposition 7.1.3 and Theorem 7.3.2 from the above lemmas.

Proof of Proposition 7.1.3. Lemma 7.3.2 gives IAk I=Ilk IAk-l I where

Ilk = r k,l( a 2(k + l/a)2r k,O) and A k_1= IIAijll~~1 .


l,j=O

Due to the Stirling approximation

for k 4 00•

This gives (7.1.24): the proposition is proved.

Proof of Theorem 7.3.2. Set

We have ai*=ui/A for i=O,l, ... ,k, where A=A'A-1A is calculated by (7.3.9). Represent
U'I. as
276 Chapter 7

-I
u. =A'A e.
1 1

where all components of the i-th coordinate vector ei are zero except the i-th which equals
1. Applying Lemmas 7.3.1 - 7.3.3 with Dk=A and z=ei, one obtains the expressions for
a1·* .
Tum now to the vector a+ and represent ai+ in the form

(7.3.12)

where ~i are the components of the vector

Since the last column of A is proportional to b, thus all the components ~i equal zero,
except ~k:

~ . = 0 for i
1
= O, ... ,k - 1, (7.3.13)

The expressions for ai+ follow from the expressions derived for at and from (7.3.9) -
(7.3.12). The theorem is proved.

7.3.3 Proof of Theorem 7.3.1

Due to (7.1.19), (7.1.23) and (7.3.9), we have

*
E ( M N .k - M
) 2 2
-ON,k for N -7 00, k -7 00,
2
k IN -70 (7.3.14)

where

for a> 1, a"* 2

for a= 2.

By the Stirling approximation we obtain

r(k + 2)/r k,2 _k l - 2/u for k -700.


Bounds of Random Variables 277

Hence, for k---7 oo

r(M - eN) 20 _ a/2)/ r 0,2 for 1 < a < 2


2 2 2 -1+2/0.
°N,k -() N,k = l(M - eN) 2(1- 2/a)k for a> 2

(M - eN) / log k for a = 2

that is (7.3.1) and (7.3.2) hold for the estimate MN ,k* determined by the vector (7.1.21).
Turn now to the estimate MN,k+ determined in (7.1.22).
By (7.1.22), a+=(A-IA_Pk~)lqk' where

2
qk=A'A -11._ (b'A -11.) ;(b'A -I b),

Pk=b'A- 1 A/b'A- 1 b, ~=A-lb.

Using (7.3.9), (7.3.10), (7.3.11) and the Stirling approximation for k---7 oo , we obtain

A'A -l b = r /r _k -1/0.
k1 k,2 '
b'A
-1 ....2.
b = l-k/ (r k,2r(k + 2» -1.

With the help of (7.3.13), we get a*-a+ for k---7 oo : this yields (7.3.1) and (7.3.2) for the
estimate M N k +.
Prove no~ the asymptotic normality for the optimal linear estimates.
Set

DN,k= (M - MN,k)/(M - eN)·

Then, by (7.1.8)

where either ai=ai* or ai=ai+ are the coefficient of the vectors (7.1.21) and (7.1.22),
respectively, further on So,S 1,... are independent and have the density e-x, x>O.
Set
278 Chapter 7

i
U.
1
= .LV.
0 J
for i =0,1, ...
J=

Then

i (i )l/a lIa( u. )lIa


L ~ . = u. + (i + 1),
j=O 1 1
G;o~ j = (i + 1) \ 1 + i +\

For each i the random variable uj/(i+ 1) has density

ii+l(X + dexp{- i(x+ 1)}/r(i + 1), x 2::- 1,

with the maximal value

for i ~oo

at zero. Consequently,

lim u./(i+l)=O a.s.


i~oo 1

and

(l+u/(i+l)
lIa u.
=1+u/(a(i+l»)+O C+\)
( 2) for i ~oo a.s.

This, together with ai~O for any fixed i and k~oo, yields

k l/a k u.
D N k - La. (i + 1) + La. \ -1/
, i=O 1 i=O 1 a(i + 1) a

for N~oo, k~oo, k/N~ O.


The optimal coefficients ai satisfy the unbiasedness condition (7.1.20), either exactly
or asymptotically exactly, i.e.
k
La.r. /r. 0 ~O for k ~oo.
i=O 1 1, 1,

By the Stirling approximation, ri,l tri,o-i lla for i~oo. Therefore, using once more the
relation ai~O for k~oo, we obtain
Bounds ofRandom Variables 279

k . 1/ a. k k (. 11 a. )
La.(1+1) = La.r./r. o+ La. (1+1) -r./r·0-70
i=O 1 i=O 1 1, 1, i=O 1 1, 1,

for k-7 oo • This gives for k-7 oo

-1 k . -1+11 a. k
DNk-a La.u.(1+1) = Lv.S.
i =0 1 1 i=O 1 1

where

k
s. = a -1 La .(J. + 1) -1+1/a..
1 j=i J

Using the expression for ai=ai* (for ai=ai+ the expressions are asymptotically equivalent)
obtained in Theorem 7.3.2, we can derive the asymptotic forms for si (i=O,1 ,... ,k).
If a>2, then for k-7 OO

i=1,2, ...

If a=2 and k-7 oo,then

So -lIlog k,
s.
1
_(C 1l2 _k- 1I2)/log k for i = 1, 2, ...

Let us show that Lyapunov's condition

(k -7 00 )

for 0=2 holds, where


280 Chapter 7

We have

Ev~1 = Jx 4 exp {- (x + l)}dx = 9,


-1

k 4 k k
LElv.s.1 = LS~Ev~=9LS~.
i=O 1 1 i=O 1 1 i=O 1

For a>2 and k~oo, we have

2 -1+2/a
Bk -(1- 2/a)k .

For a=2 and k~oo there holds

2
Bk -l/log k.

Turn now to the asymptotic representation of


k
LS~.
i=O 1

For a>2, i>O and k~oo there holds

If a:;t:2, a:;t:4, a>2 then for k~oo

For a=3 , k~oo


Bounds of Random Variables 281

k 4 -4/3
I, s~ -( 6 + 4/r(1 + 2/3)) k .
i=O 1

For <x=4 , k--7 oo

k 4
""L.. s.4- (3)
2' k -2 log k.
i=O 1

For <x=2 , k--7 oo

k
I, s~ -8/log 3k.
i=O 1

Consequently, for k--7 oo

lllOg k for <X = 2


f
L ~lk -2/3 for <X = 3
k,2 (log k)/k for <X = 4
k -1 for <X > 2, <X "# 3, <x"# 4

Thus, in all cases Lk 2--70 for k--7 oo •


This yields the desired result: the theorem is proved.

Let us remark that there is another method of proving the asymptotic normality
(7.3.3), i.e. the second part of Theorem 7.3.1. This method consists in referring to the
asymptotic normality result of Csorgo and Mason (1989) and mentioning that the optimal
linear estimators MN k * and MN k + asymptotically, if k--7 oo , coincide with Csorgo-
Mason estimators which are dete~ined by the coefficient vectors a with components
(7.1.27).
Indeed, the asymptotic coincidence of the estimators MN k * and MN k + was
established in the proof of Theorem 7.3.1. Further, applying d.1.27) and Theorem
7.3.2, that contain explicit expressions for the components ai and at of the Csorgo-
Mason and optimal linear estimators, respectively, we obtain for k--7 oo , <x>2:

* (a+1)(a-2) k2/a-1
a O- anI +2/a) ,

a~--72-<x,
282 Chapter 7

* ( a- 1)(1 - 2/)
a.-a.- al.-2/a k 2/a-l for i <k, i ~oo,
1 1

a 1./a~1 - 1 + ~ [1 - ilog (1 + 1/0 + 'I'(i + 1) -log (i + 1)]

for i ~ 0, i = const, a ~oo,

where 'I' is the psi-function and Olog(1 +1/0)=1. Analogously, for a=1, k~oo we have

a~ -3/log k,

a k -a~ -- 2/log k,

a.1 -a~ - V(ilog k) for i < k, i ~ 00,


1

a 1./a~1 - (i + l)log(1 + Iii) for i > 0, i = const .


For l<a<2 the Csorgo-Mason estimator is not defined and the asymptotic (while k~oo)
expressions for at are as follows:

a ~ ~ (a + 1)(1- a/2), a~ -- ak 1- 2Ia(l_ a/2)IU + 2/n),

for i < k, i ~oo.

Bearing in mind the asymptotic expressions presented above let us introduce a new, very
simple, linear estimator determining its coefficients by the formulas: for 1<a<2

-2/a
a.
1
= (a - 1) (1- a/2)i r(1 + 21 a), 0< i < k,

for a>2

0< i < k,

a k =2-a.,
Bounds of Random Variables 283

For <l=2 the expressions for at are so simple that it is hard to simplify them. Thus, we set
ai=ai* for <l=2, O~i~k.
The discussion concerning the Csorgo-Mason and optimal linear estimators leads to
the conjecture that the above defined linear estimator has the asymptotic properties (7.3.1)
and (7.3.3), i.e. it is also asymptotically normal and efficient.
CHAPTER 8 SEVERAL PROBLEMS CONNECTED WITH GLOBAL
RANDOM SEARCH

As it was pointed out in Section 2.3 and 4.1, the theory of global random search is
connected with many important branches of mathematical statistics and computational
mathematics. Two of them were already studied in Section 5.3 and Chapter 7, three others
will be treated in this chapter (note that their connection with global random search was
highlighted in Section 4.1).

8.1 Optimal design in extremal experiments


We shall consider a particular class of local optimization problems related to a regression
function, as a class of extremal experiment problems: its features will be discussed in
Subsection 8.1.1.
The standard rule in extremal experiment algorithms involves the least square
estimation of the gradient of an objective functionl at the current point and moving in the
direction of that estimate. The main purpose 0 this section is to show that one can
construct much simpler algorithms, sequentially replacing the actual optimality property by
another.

8.1.1 Extremal experiment design


The search for the local extremum of a regression function is often treated as an extremal
experiment problem; naturally the search for the global extremum belongs to the field of
global optimization.
The cost (or time-consumption) of the objective function evaluation greatly influences
the selection of the extremal experiment algorithm. If the cost is high and the number N of
function evaluations can not be large, then a passive grid algorithm may be expedient. If
the cost is low and N can be rather high, then adaptive algorithms of the stochastic
approximation type are usually applied. All of them are representable by the following
general recurrent relation

(8.1.1)

where k=1,2, ... is the iteration number, Xl is a given initial point, Xl,X2, ... is the
sequence of points in X;C:RP generated to approach a local minimizer of f ' 'YloY2, ... are
nonnegative numbers called the step lengths, s 1,s2,." are random vectors in R.n called the
search directions. To construct each subsequent point in (8.1.1), one uses the results of
evaluations of the regression function f (i.e. the random values y(x)=f (x)+~(x), where
E~(x)=O) at the preceding points of the sequence and possibly also at some auxiliary
points. As usual, we shall suppose that different evaluations of f produce independent
random values.
The majority of works, devoted to the local optimization of a regression function,
studies the asymptotic characteristics of the algorithm-type (8.1.1). The extremal
284
Problems with Global Random Search 285

experiment design deals with the one-step characteristics of the algorithms, rather than
with their asymptotic properties.
The extremal experiment algorithms (the simplex method of NeIder and Mead (1965)
and steepest descent are, probably, the most popular of them) have been developed and
applied for optimization of real objects (even in absence of computers) and thus have a
number of properties that distinguish them from the stochastic approximation type
adaptive algorithms. The pecularities of most extremal experiment algorithms are due to
the inclusion of the following elements into each of their step: (i) statistical inference
(usually: linear regression analysis for constructing and investigating the local first or
second degree polynomial model of f ), (ii) specific experimental design for selecting
auxiliary points to evaluate f (the design criteria are chosen among the following:
symmetricity,orthogonality, saturation, rotatability, simplicity of construction, optimality
in some appropriate sense, etc.), (iii) selection of the search direction in accordance with
the regression models constructed (the least square estimate of the gradient of f at Xk is
customarily used as the search direction), (iv) selection of the step length at random via
evaluating f at several auxiliary points along the chosen direction.
As for the step length selection rules, they are thoroughly discussed in many works,
see for instance Ermakov et al. (1983). The procedure of Myers and Khuri (1979) and its
modification studied in Zhigljavsky (1985) seem to be the most promising and
recommendable. Below we shall deal only with the search direction construction problem.

8.1.2 Optimal selection of the search direction

Let f be a regression function, y(x)=f (x)+~(x) be the (partially) random result of


evaluating f at a point xE'XCRn, E ~(x)=O for each XE'X, E ~2(x)=a2=const, the results
corresponding to various evaluations of f be mutually independent, Xk be a fixed point
obtained at the k-th iteration, and zl, ... ,zN be the points at which f is evaluated at the k-th
iteration, in order to determine the search direction s=sk. Without loss of generality,
assume that Xk=O. We shall introduce the following two suppositions that are standard in
the construction of extremal experiment algorithms. First, the points zl, ... ,zN are selected
in a sufficiently small neighbourhood U of xk=O, in which f is approximately linear, i.e.
n
f(zj) = eo + i~e iZij (8.1.2)

for j=I,... ,N where zlj. ... 'Znj are the coordinates of the point zj- (Note that interesting
results concerning the choice problem of Xk+ 1 and applying the second order polynomial
model of f are derived by MandaI (1981, 1989).) Second, the k-th step design, i.e. the
point selection { zI. ... ,zN} is symmetrical, that is the equality

N
Lz .. =O
. 1 1J
(8.1.3)
J=
286 Chapter 8

holds for i=l, ... ,n.


The (approximate) gradient of fat xrO is a vector

of unknown parameters. If N>n and rankZ=n, then 9 can be estimated by the least square
estimator

1\ -1
9 =ClZ') 'ZY. (8.1.4 )

Here

is the design matrix and Y=(Yl, ... ,YN)' is the vector of evaluation results, in accordance
with (8.1.2) consisting of the elements
n
y.=9 0 + L9.z .. +~., j= 1, ..• ,N (8.1.5)
J i =1 1 IJ J

where ~l>""~N are mutually independent, E ~j=O, E ~l=cr2.


The selection of (8.1.4) as the search direction IS typical in extremal experiment
algorithms (e.g. for steepest descent). The popularity of the choice of an unbiased estimate
of 9=V f(Xk) as the search direction s=sk, is motivated by the fact that the average
f
decrease rate of in the direction -s is locally maximal. But this is not the unique sensible
optimality critenon for choosing s: let us consider another one below.
Every search direction s is constructed using the values of some random variables;
thus the function f may locally increase along the selected direction rather than decrease.
Since increasing function values in the direction s are undesirable, therefore the probability
of decrease of f is a characteristics of obvious importance and can be selected as the
optimality criterion in choosing s.
The fact that a function f decreases in some direction -s, can be written in the form

Thus, the probability of decrease of f in the direction -s equals

Pr{9's> OJ. (8.1.6)

We shall maximize (8.1.6) with respect to s, in the class of linear statistics of the form
Problems with Global Random Search 287

s= seA) = AY, Ae A, (S.1.7)

where A is the set of matrices of the order nxN and Y consists of the evaluation results
(S.1.5), thus satisfying the conditions

EY = Z'S, covY = a2IN. (S.1.S)

Set

-1/2
teA) = teA,S) = S's(A), n(A) = n(A,S) = (t(A) - E t (A»(var t(A» .

With these notations, the probability (8.1.6) can be rewritten as

-1/2}
Pr{t(A) > O} = Pr { 'Il(A) > - (var teA»~ E teA) (S.1.9)

Under fixed AeA and SeRn, the random variable n(A,S) is a linear combination of the
random variables ~1"'" ~N' having zero mean and unit variance. If ~j are Gaussian, then
'Il(A,S) follows Gaussian distribution as well and the probability (S.1.9) is completely
determined by the magnitude

-1/2
K(A) = K(A,S) = (var teA»~ E teA) (S.1. 10)

which is to be maximized with respect to Ae A. If ~j are not necessarily Gaussian, but N


is large, then by virtue ofthe central limit theorem 'Il(A,S) is approximately Gaussian, i.e.
the probability (S.1.9) can be characterized again by K(A). In the general case the
distribution of the random variable 'Il(A,S) depends on A and S and thus the quantity K(A)
determines the probability (S.1.9) within some accuracy, rather than exactly. As it is usual
to say in similar situations, the value

A * = arg max K(A) (S.1. 11)


AEA

gives a quasi-optimal solution of the initial problem of maximizing (S.1.6).


It is usual when solving extremal problems with an objective function depending on
parameters ( S in the present case) that the optimizer also depends on these parameters.
The important specialty of the investigated extremal problem is that an optimal matrix A *
exists, the same for all 8:;tO.
288 Chapter 8

Theorem 8.1.1. For each S*O, the equality

1/2
max K(A) = a-1(S'ZZ'S) (8.1.12)
AEA.

holds an the maximum is attained at the matrix A*=Z.

Proof. We have

Et (A) = S'AZ'S,

var(t(A)) = var(S'AY) = S'A(covY )A'S = a 2S'AA'S.

Denote by 'L(A) the functional

2
'L(A) = a 2 J2(A) = (S'AZ'S) /(S'AA'S)

and by E an arbitrary matrix in A. Bearing in mind the identity

S'EA'S =S'AE'S,
let us compute the derivetive of'L at the point (i.e. matrix) A in the direction E:

O'L(A)
~ = lim ['L(A + aE) -'L(A)]/a=
a~O

=2 S'AZ'S S'ECZ'SS'AA'S -A'SS'AZ'S)


2
(S'M'S)

Let us characterize now each matrix A for which the derivative equals zero, for all Ee A.
The equality O'L(A) I dE=O for all Ee A is equivalent to that either

9'AZ'S = ° (8.1.13)

or

Z'S9'M '9 = A'S9'AZ'S (8.1.14)

takes place. If (8.1.13) holds, then 'L(A)=O and thus 'L is being minimized: therefore we
shall assume that (8.1.13) does not hold, i.e. S'AZ'S*O. The equality (8.1.14) is
equivalent to A'S=eZ'S, where
Problems with Global Random Search 289

c = A'99'A/Z'99'A
is a positive number. Substitute cz'e for A'e in the expression for teA): this way, one
obtains that if for some AeA, 9:;!:O (S.1.14) is valid, then t(A)=9ZZ'9 the maximal value
of t under a fixed e. This value is obviously reached for A=Z. The theorem is proved.

The theorem implies that the optimal search direction, in the above sense, is given by

s* = LX (8.1.15)
1\
The advantages of s* compared to s= 9, i.e. the least square estimate (S.1.4) of a, are
two-fold: it is much simpler to compute and can be used also for N~n; moreover, the
search direction (S.1.15) is of slight sensibility to violations of the validity of the linear
regression model (S.1.5). In fact, (8.1.15) is nothing but a cubature formula for
estimating the vector

Jzf(z)dz
u
which converges to a=Vf(Xk) when asymptotically decreasing the size ofU.

8.1.3 Experimental design applying the search direction (8.1.15)

Theorem 8.1.2 yields that if the search direction s=sk is chosen at each step of the
algorithm (8.1.1) by (8.1.15), then the experimental design (i.e. the selection rule for
points zl, ... ,zN in U) is to be selected so that the values a'zz' a are as large as possible,
for various a. Let us formulate the design problem in its standard form for the regression
design theory (see e.g. Ermakov and Zhigljavsky (1987)).
Let E be an arbitrary approximate design, i.e. a probability measure on U,

J
M(E) = zz'E (dz) (8.1.16)
u

be the information matrix of the design E. For the discrete design

corresponding to evaluations of f at the points zl, ... ,zN, the matrix (8.1.16) equals

-1 -1 N
M(EN) = N ZZ'= N LZ .Z'..
j=1 J J
290 Chapter 8

Considering the design problem in the set of all approximate designs, we have the class of
criteria

<I> e(e) = 8'M(e)8

depending on the unknown parameters 8. Since the true value of 8 is unknown, the true
optimality criterion <I>e is unknown, too. In a typical way, Iet us define the Bayesian
criterion

<I>B(e) = J8'M(e)8 v(d8) = tr M(e) J8e'v(d8) (8.1.17)


n n
and the maximum criterion

<I>M(e) = min 8'M(e)8 (8.1.18)


een
where Qc:R. n is the set containing the unknown parameters 8 with probability one and
v(d8) is a probability measure on Q reflecting our prior knowledge about 8.
Since the norm II 8 II of the gradient 8=17 f (Xk) is of no interest, we may assume that
11811 =1, i.e.

If prior information about 8 is essentially not available, then it is natural to take S as Q and
the uniform probability measure on Q as v(d8). Now, if Q=S, then by virtue of the
maximal matrix eigenvalue properties, (8.1.18) is nothing but the maximal eigenvalue of
M(e): this way, <I>M is the well-known E-optimality criterion in the regression design
theory. Thus, the optimal design problem for the maximum criterion (8.1.18) is reduced
to that of classical regression design theory.
Let us turn now to the Bayesian criterion (8.1.17). Denoting

L = J88'v(d8)
n
one can easily derive (see Zhigljavsky (1985» from the equivalence theorems of
regression design theory that the set of optimal designs with respect to the Bayesian
criterion (8.1.17) i.e.

{e* = arg m:x triM(e)}


coincides with the set of all probability measures concentrated on the set
Problems with Global Random Search 291

{ arg !~ z'Lz }

Thus, the indicated design problem either is very easy or can be reduced to a standard
problem.
292 Chapter 8

8.2 Optimal simultaneous estimation of several integrals by the


Monte Carlo method
A number of Monte Carlo and experimental design problems can be reduced to
optimization of a convex functional of a matrix with the probability density in the
denominator of each element. Existence and uniqueness of optimal densities for a large
class of functionals are studied. Necessary and sufficient conditions of optimality are
given. Algorithms for constructing optimal densities are suggested, the structure of these
densities is investigated. The exposition of the section follows Zhigljavsky (1988) and,
partly, Mikhailov and Zhigljavsky (1988).

8.2.1 Problem statement


A large number of Monte Carlo and experimental design problems consist in choosing a
probability distribution P, given on an arbitrary measurable set (X,B), with the density
p=dP/dv with respect to a a-finite measure v on (X,B), sampling this distribution and
estimating some linear functionals (integrals). The efficiency of such algorithms depends
to a great extent on the choice of the density p and is defined as a functional of the
estimator covariance matrix, with p in the denominator of its elements. These algorithms
are considered in Zhigljavsky (1988) and some of them are presented below.
The optimality problem arising in the mentioned algorithms has the general form

p* = arg min <I>(D(p»). (8.2.1)


peP

Here the matrix

D(p) = Jg(;~!;X) v(dx) - A (8.2.2)

has order mxm and in most cases is proportional to the covariance matrix of the estimator,
g(x)=(gl (x), ... ,gm(x», is a vector of linearly independent piecewise continuous functions
from L2( X,v), P is the set of densities p such that II D(P) II <00, A is a matrix in the set N
of positive semi-definite matrices of order mxm for which D(p)-Ae N for all pe P (in
some cases A is the zero matrix), <1>: N --+:R. is a continuous convex functional.
The most well-known problem leading to (8.2.1) is the Monte Carlo simultaneous
estimation of several integrals and is formulated as follows.
Let m integrals

k= 1, ... ,m, (8.2.3)

be estimated. The ordinary Monte Carlo estimates of the integrals (8.2.3) are constructed
in the following way. First a probability distribution P(dx)=p(x)v(dx) on the measurable
space (X,B) is chosen: here the density p=dP/dv is positive modulo von the set
Problems with Global Random Search 293

Then N independent elements xlo""xN are generated from the distribution P. Finally, the
integrals (8.2.3) are estimated by the formulas

k= 1, ... ,m. (8.2.4)

Set

and prove the following auxiliary assertion.

Proposition 8.2.1. If the density p=dP/dv is positive modulo v on the set X*, then the
estimators (8.2.4) are unbiased
1\
(i.e. E'L='L)

and their covariance matrix equals

where the matrix D(P) is defined by formula (8.2.2) with A='L'L'.

Proof The unbiasedness of the estimators (8.2.4) is an evident result, well-known in the
Monte Carlo theory. We have for variances and covariances the relations

The formulas above imply the desired form of


1\
cov'L:
the proof is complete.
294 Chapter 8

It follows from the proposition that the matrix (8.2.2) with A='L'L' represents the
quality of the Monte Carlo estimators (8.2.4) depending on the density p. The problem of
optimal density selection was earlier investigated in the above framework for the two
optimality criteria depending only on diagonal elements of the matrix D(p). Evans (1963)
solved the problem for the criterion
m
<I>(B)= La.b .. , (8.2.5)
i=l 1 11

where bii are the diagonal elements of a matrix BeN and al, ... ,am are fixed nonnegative
numbers: this problem is rather simple. Mikhailov (1984, 1987) solved the extremal
problem (8.2.1) for the MV -optimality criterion

<I>(B) = max b ... (8.2.6)


l::;i::;m 11

Section 8.2.6 describes Mikhailov's results.


Zhigljavsky (1985) studied the problem for general optimality criteria, but he has not
exhaustively investigated the existence of optimal densities. Besides, here we study the
minimax type optimality criteria more thoroughly than earlier.
Another important problem related to global optimization consists in estimating the
Fourier coefficients of a regression function by its observations at random points and is
described as follows.
Let f be a regression function on X with uncorrelated observations at N points Xj
j=l, ... ,N:

ES(x) = 0,

and let the Fourier coefficients

k = 1, ... ,m,
of the function f with respect to a set of functions { f 1,... , f m} be estimated.
Suppose now that the points xl, ... ,xN are randomly and independently chosen,
having the same distribution P(dx)=p(x)v(dx), where the density p=dP/dv is positive on
the set

If one uses the Monte Carlo estimators

(8.2.7)
Problems with Global Random Search 295
for'Lk then one can prove (analogously to Proposition 8.2.1) that these estimators are
unbiased; further on, that their covariance matrix is equal to
1\
cov'L = N-1 D(p),
where D(p) is determined by (8.2.2), A= 'L'L and I

V2
gk(x) = (f2(x) + (J2(x») f k(x), k = 1, .•• ,m.
Therefore the optimal density choice problem is a particular case of the problem (8.2.1).
Let r(x)=f(x)-('Llf 1(x)+... + 'Lmf m(x». If the least square estimates are used instead
of (8.2.7), then their covariance matrix is represented as follows

1\
cov'L
1
= ND(P) + 0 ( N -1) ,
whereA=O,

V2
gk(x) = (f2(x) + cr 2(x)) f k(x), k = l, ... ,m,
(see later Theorem 8.3.4). This way, we have arrived to the problem (8.2.1) again.
The density p is interpreted as the experimental design; the problem of its optimization
is similar to the approximate optimal design problem in classical regression design theory.
The main difference between these problems lies in the following. In the classical theory
experimental design stands in the numerator of the matrix elements and a convex
functional of this matrix is minimized, while the design in the extremal problem (8.2.1)
stands in the denominator of the elements. From the theoretical point of view, the problem
(8.2.1) is a little more complicated than the regression optimal design problem. The main
additional complexity is in the existence of the optimal design.
The task of selecting the optimality criterion <I> is analogous to the corresponding one
in classical regression design theory. The main difference between them lies in imposing
stronger conditions on the optimality criterion in the problem (8.2.1), than the convexity
and monotonicity required in the regression design theory. Subsection 8.2.2 will describe
these conditions.
Subsection 8.2.3 covers the existence and uniqueness problems: their solution is
based on general convex analysis.
Subsection 8.2.4 presents the necessary and sufficient conditions for densities to be
optimal. The basis for the results of the subsection is the equivalence theory for optimal
regression design developed by J. Kiefer, J, Wolfowitz, V. Fedorov and others: its
statements are also analogous to the equivalence theorems.
Subsection 8.2.5 describes algorithms for constructing optimal densities and the
structure of optimal densities. Nondifferentiable MV - and E-optimality criteria are treated
in Subsection 8.2.6.
Subsection 8.2.7 highlights the difference and similarities between classical regression
design theory and the results presented here.
296 Chapter 8

8.2 .2 Assumptions

Below we shall suppose that the functional <1>: N -+:R.. is nonnegative, continuous, convex
and increasing. The increase of <1> is defined as follows: if B,Ce N and B>C (or B~C),
then <1>(B»<1>(C) or, respectively, <1>(B)~<1>(C).
Let n be the closure of the set N -A={ C-A, Ce N} containing some matrices with
infinite elements. Extend the functional <1> from N onto n preserving its continuity and
convexity. Suppose that this extension <1>: n-+:R..u{ +oo} has the property
a) <1>(B)<oo for Be n,if and only if all elements of B are finite.
We also suppose that the following simple condition is satisfied:
b) there exists a density pin P for which the matrix D(p) consists of finite elements
only.
The above suppositions are required to hold everywhere. Many widely used criteria
satisfy them, for instance, the linear criterion

<1> (B) = tr LB, LeN, (8.2.8)


its special case (8.2.5), the MV-criterion (8.2.6), the E-criterion

<1>(B) = Amax(B) (8.2.9)

where Amax(B) is the maximal eigenvalue of the matrix B, and the so-called <1>rcriterion

(8.2.10)

for 1.:::; r <00. If r =1 then (8.2.10) and (8.2.8) with L=Im coincide. If r -+00 then (8.2.10)
converges to (8.2.9). On the contrary if -00< r <1, then according to Pukelsheim (1987)
the criterion (8.2.10) is increasing and concave which case is unsuitable here. The same is
true for the well known D-criterion

11m
<1>(B) = (m- 1 det B) (8.2.11)

which can be regarded as the limit of (8.2.10) underr-+O.


Sometimes we need the condition of strict convexity of the functional <1>: N -+:R.. and
also its differentiability. The last condition means the existence of the matrix

<D (B) = ( ~ )<B)


Problems with Global Random Search 297

consisting of the partial derivatives of <1>(B) with respect to the elements bij of the matrix
BeN.
All the above suppositions are related to the functional <1>. Besides them, we need two
further assumptions concerning the functions gl, ... ,gm' It will be supposed throughout
that they are piecewise continuous linearly independent functions from L2(X,v).
Sometimes we shall also use the following condition of their v-regularity:
c) functions gl ,... ,gm are linearly independent on any measurable subset Z of the set
X with v(Z)>O.
In a slightly different form the v-regularity condition c) was used by Ermakov (1975),
when investigating random quadrature formulas.
Note also that we do not require more concerning the set X, than its measurability.

8.2.3 Existence and uniqueness of optimal densities

Set p+(x)=max{O,p(x)} for all p ofLI (X,v) and

It is evident that the set Q is convex. Define the functional

by cp(p)=<l>(D( p+»: we shall prove first that cp is convex.

Proposition 8.2.2. For all p,q form Q and any O<t<1 we have

J
D(t,p,q) = u(x)g(x)g'(x)v(dx) (8.2.12)

where the matrix D(t,p,q) is defined by formula

(8.2.13)

and the matrix D(t,p,q) is positive semi-definite.


The proof is based on elementary algebraic transformations, see Lemma 8.1.2 in
Zhigljavsky (1985).
298 Chapter 8

Proposition 8.2.3. If condition c) holds, then the matrix (8.2.13) is positive definite
for any t from (0,1) and all p,q from Q, p;tq (modulo v).

Proof. By virtue of Proposition 8.2.3, the matrix (8.2.13) can be written as

J
D(t,p,q) = u(x)g(x)g'(x)v(dx) (8.2.14)
z
where Z={xe'X: u(x»O}. Since p;tq (modulo v) and te (0,1), then v(Z»O. Using the
supposition c), we obtain that functions gb ... ,gm are linearly independent on Z.
Evidently, the functions

FuN g.(x),1
i= l, ... ,m,

have the same property. The positive definiteness of matrix (8.2.14) follows now directly
from the definition of positive definiteness: the proof is completed.

Proposition 8.2.4. If the functional <I> is convex, increasing and the conditions a), b)
are fulfilled, then the functional cp is convex on Ll ('X,v). Besides, if either the functional
<I> is strictly convex on N or the condition c) holds, then cp is a strictly convex functional
onP.

Proof. Let p,qe Ll ('X,v), p;tq (modulo v), and te (0,1). The convexity of cp implies that
the inequality

tcp(p) + (1- t)cp( q) ~ cp( tp + (1- t)q) (8.2.15)

holds. If either p~ Q or q~ Q, then the inequality (8.2.15) is certainly fulfilled; therefore


we may suppose that p and q belong to Q.
By virtue of Proposition 8.2.2, we have

(8.2.16)

This and the increase of <I> yields the inequality

(8.2.17)

The convexity of <I> implies that

tcp(p) + (1- t)cp(q) ~ <I>(tD(p+) + (1- t)D(q+)). (8.2.18)

Coupling the inequalities (8.2.17) and (8.2.18) we obtain (8.2.15), i.e. the convexity of
cpo
Problems with Global Random Search 299

The strict convexity of <p on:P is equivalent to the strict inequality in (8.2.15) for p and
q from:P. This inequality follows from the strict inequalities in either (8.2.17) or (8.2.18)
or both. The strict inequality in (8.2.18) follows from the strict convexity of <1>. On the
other hand, if the condition c) holds then, by virtue of Proposition 8.2.3, the strict
inequality in (8.2.16) takes place. This way, by virtue of the increase of <1> the strict
inequality in (8.2.17) is valid. The proof is completed.
Now we are able to investigate the existence and uniqueness of the optimal density.

Theorem 8.2.1. If the functional <1> is continuous, convex, increasing and conditions
a), b) are fulfilled, then the optimal density p* in:P exists.

Proof The set

J
S = {p e LtX,v): p;;:: 0, p(x)v(dx) ~ 1}
contains :P, is bounded in norm and (due to the Fatou theorem) closed by measure.
Let us show that the functional <p is lower semi-continuous in measure: it means that
the inequality

~ inf <P(p.) ;;:: <pep) (8.2.19)


l~oo 1

holds for any sequence PJ,P2,'" of elements of Ll ('x ,v) which converges in measure V
to an element p of Ll (X,v).
Using the Fatou theorem, we have for any vector beRm:

. . . . Jb'g(x)g'(x)b
b'(rn:n mfD«Pi) ))b=rn:n mf ( (») v(dx)-b'Ab;;::
l~OO + l~OO Pi x
+

; : Jb'g(x)g'(x)b v(dx) - b'Ab = b'D(p )b.


p+(x) +

This way, the inequality

~ infD«p.) ) ;;::D(p.)
l~OO 1 + 1

holds. From this, we obtain the inequality (8.2.19) and the continuity and monotonicity of
the functional <1>.
Theorem 6 from §5 Chapter 10 of Kantorovich and Akilov (1977) states a generalized
Weierstrass-type theorem, viz. that a convex lower semicontinuous in measure functional
on Ll (X ,v) attains its minimum value in any subset of Ll (X ,v) closed by measure and
bounded in norm. Using Proposition 8.2.4 and the above results we get that the functional
300 Chapter 8

<p attains its minimum value in S at a certain point P*E S. By a) and b) we have
<D(D(p*»<oo and II D(p*) II <00. Let us show that Jp*(x)v(dx)=l. It follows by the
incidence P*E P and the statement of Theorem, too. Assuming the contrary, let

r= Jp*(x)vCdx) < 1
and put q*(x)=p*(x)/r. We have q*EP,

D( q*) = r(D(p*) + A) - A, D(p*) - D( q*) = (1- r)(D(p*) + A).

Since II D(p*) II <00 and the functions g1, ... ,gm are linearly independent on 'X, therefore
the functions

-V2
(P*Cx») g.(x), i= l, ... ,m,
1

have the same property. Analogously to Proposition 8.2.3, we can see that the matrix
D(p*)+A is positively defined. From the monotonicity of the functional <D we get
<D(D(p*»><D(D(q*», but this inequality contradicts to the optimality ofp*. Consequently,
r= 1 and P*E P: the theorem is proved.
Let us tum now to the uniqueness of the optimal density.

Proposition 8.2.5. Let the conditions of Theorem 8.2.1 be satisfied and assume that
either the functional <D is strictly convex or the condition c) holds. Then the optimal
density P*E P exists and is unique modulo v.
This statement is a simple corollary of Theorem 8.2.1 and Proposition 8.2.4.

8.2.4 Necessary and sufficient optimality conditions


The following statement is analogous to the classical equivalence theorem of regression
design theory.

Theorem 8.2.2. Let the assumptions of Theorem 8.2.1 be fulfilled and the functional <D
be differentiable. Then a necessary and sufficient optimality condition, for a density p*, is
the fulfilment of the equality

'I'(x,p*) =c(p*) (8.2.20)

for v-almost all x in 'X. Here


o o
c(p) = tr <D (D(p»)(D(p) + A), 'I'(x,p) = g'(x) <I> (D(p»)g(x)/ p 2(x)
(8.2.21)
Problems with Global Random Search 301

Proof. We shall use the necessary and sufficient condition of optimality, for a convex
differentiable (along all admissible directions) functional on a convex set (see e.g.
Ermakov (1983), p.55 or Ermakov and Zhigljavsky (1987), p.105). Let us compute the
derivative

n(p,q) = ~CP(1- t)p + tq)1 t=O+ (8.2.22 )

and find the density p* in the set1' such that n(p,h)~O, for all densities h on 'X. Simple
calculations (see Zhigljavsky (1985» yield

n(p,q) = c(p) - J'I'(x,p)q(x)v(dx)

So n(p,h)~O for all h, if and only if the inequality 'I'(x,p)~c(p) holds for v-almost x in 'X.
The statement of the theorem follows from the equality

J'I'(x,p)p(x)v(dx) =c(p)

which is easily verified and is valid for all densities pEP. The theorem is proved.

Suppose now that the optimality criterion is nondifferentiable and minimax, i.e. it is
expressed by

<1> (B) = max <1> y (B), (8.2.23)


yeV

where V is a compact set and all functionals <1>v are convex and differentiable.

Theorem 8.2.3. Let the functional <1> have the form (8.2.23), where all functionals
<1>v(VE V) satisfy the assumptions of Theorem 8.2.2 and the function <1>v(B) is continuous
for any fixed matrix B from N. Then the fulfilment of the inequality

sup 'I' y( x,p*) ~ cy(p*)


xe'X

for a certain v from the set

V(p*) = {arg max<1>v(D(p*))}


yeV

is a necessary and sufficient condition for the optimality of density p*: here'l'v and cv are
defined by (8.2.21), with the substitution of <1>v for <1>.
The proof is analogous to the proof of Theorem 8.2.2 and uses the formula
302 Chapter 8

IT(p,q) = max IT (p,q)


veV(p) v

where the derivative I1v is defined in (8.2.22), with the substitution of <l>v for <1>, and
equals

IT (p,q)
v
=cv(p) - J'If v(x,p)q (x)v(dx)
Note finally that Theorem 8.2.3 is also analogous to the corresponding statement in
regression design theory.

8.2 .5 Construction and structure of optimal densities

Consider first the structure of optimal densities for the linear criterion (8.2.8). In this case
o
<I> (B) = d trLB/dB = L
and the optimality condition (8.2.20) is equivalent to

1/2 1/2
p*(x) = (g'(x)Lg(x» / J(g'(z)Lg(z» v(dz). (8.2.24 )

It should be noted that the expression (8.2.24) can be obtained by simpler tools, using the
Cauchy-Schwarz inequality, and that the linearly optimal density is always unique,
modulo v. The following example presents a case, when the linearly optimal density does
not belong to the set P.

Example. Let the functional <I> have the form (8.2.5), m=2, a 1=1, a2=0. The condition a)
does not hold and the optimal density has the form p*= I gl IIf I gl Idv. If the function gl
vanishes on a subset Z of X with measure v(Z»O, but the function g2 does not vanish on
the subset then p* does not belong to P. At the same time, if gl does not vanish on X and
g1 21 I g2 1eLl (X,v), then p*e P.
Consider now the structure of the optimal density and algorithms of its construction,
for an arbitrary differentiable criterion.
If the functional <I> is differentiable, but can not be represented as (8.2.8), then the
optimal density has the form (8.2.24) again, although now the matrix
o
L = <I> (D(p*»
is unknown and depends upon the optimal density p*. This is a simple corollary of
Theorem 8.2.2.
Hence, the problem of optimal density construction may be considered as the problem
of constructing the optimal matrix L. It can be solved by general global optimization
Problems with Global Random Search 303

techniques. If the values <l>(B) depend on the diagonal elements of matrices B only, then
the matirx L is diagonal and the extremal problem is not very complicated.
If the number m is large and the set 'X is either discrete or has a small dimension, then
the above algorithms may be of smaller efficiency than those described below. They use
the features of the problem and are analogous to the construction methods in optimal
regression design. They are pseudogradient-based algorithms in P, using the expression
for the derivative IT.
The general form of these methods is

k=L2, ... (8.2.25 )

Here PIEP is an initial density, <Xl,<X2, ... is a numerical sequence the choice of which
may be the same as in classical regression design. If <Xk>O, then the density hk has to
satisfy the inequality IT(Phhk).$.O: for example, one such density is proportional to the
positivity indicator of the function

(8.2.26)

are to be satisfied. The former is satisfied, for instance, for the density hk proportional to
the negativity indicator of the function (8.2.26) with Ek.$.O; the latter relation is equivalent
to the nonnegativity of the density Pk+ 1(x).

8.2.6 Structure of optimal densities for nondifferentiable criteria

Unlike in case of differentiable criteria, the necessary and sufficient condition of optimality
for nondifferentiable criteria (see Theorem 8.2.3) is nonconstructive and can not be used
for the construction of optimal densities, but only for verifying their optimality.
Nevertheless, for a large class of nondifferentiable criteria, we are able to show that the
structure of optimal densities is the same as above (cf. (8.2.24».

Theorem 8.2.4. Suppose that there exist a functional <l> and a sequence {<l>il of
functionals on N which are convex, increasing, continuous, and for which the conditions
a), b) are valid. Let also the functionals <l>i be differentiable, the condition c) hold and

<l>.1+ l(B) ~ <l>.1 (B), i = 1,2, ... , lim <l>. (B) = <l>(B)
i-7 00 1
304 Chapter 8

for each Be N. Then the <I>-optimal density p* exists, it is unique modulo v, and has the
form (8.2.24).

Proof. Denote the (f)i-optimal density by Pi*. The existence and uniqueness of the
densities p* and Pi* follow from Theorem 8.2.1 and Proposition 8.2.5. Theorem 8.2.2
gives

(8.2.27)

where LloL2, ... are matrices from N. Without loss of generality, we can assume that
IILili =1 for all i=I,2, ... Let us choose now a subsequence {4j} from the sequence {4}
converging to a certain matrix L; notice that Le N and II LII =1.
Define the density p by formula (8.2.24) and note that the pointwise limit of the
sequence { Pi*} exists.
Analogously to the proof of Lemma 1.11 in Fedorov (1979), one is able to show that
if a limit point of the sequence { Pi *} exists, then this point is the <I>-optimal density p*.
By virtue of the uniqueness of the limit, we have p*=p. Finally, from the uniqueness of
the <I>-optimal density p* we obtain that the limit of the sequence {4} exists and equal L.
The theorem is proved.

It follows from the theorem that, for many nondifferentiable criteria, the problem of
optimal density construction is reduced again to the optimal choice of the matirx L in the
representation (8.2.24).
Two nondifferentiable criteria of special importance are considered below.
Let <I> have the form of (8.2.9), i.e. it is an E-optimality criterion. Determine the
functionals <I>i by formula

where BeN. We can apply the classical inequality

Vi 1/ j

(~ a~) ~ (~ at) ,
k=1 k=1
l:S;i:S;j:S;oo (8.2.28)

for any nonnegative numbers a b ... ,am. Now we obtain the monotonicity of the
convergence of <I>i(B) to <I>(B) from the inequality (8.2.28) and the representation
. m.
trBl = L A.~
k=1
Problems with Global Random Search 305

where Al ,... ,Am are the eigenvalues of the matrix B. Hence, the analogue of Theorem
8.2.4 is applicable and the E-optimal density has the form (8.2.24).
Let us turn now to the MV -optimality criterion (8.2.6). First, let us simplify the
statement of Theorem 8.2.2 for the MV-criterion. In the representation (8.2.23) we have

V= {I, ... ,m}, <I> .(B) = tr E B


J J
where all elements of the matrix Ej equal zero except the G,j)-element which equals one.
So

We shall show now that Theorem 8.2.4 can be applied to the MV-criterion. Determine the
sequence {<I>il by

The convexity of these criteria follows from the Minkovsky inequality and their
monotonous convergence to <I> follows by (8.2.28). Theorem 8.2.4 says that if the
condition c) holds, then the MV-optimal density p* exists, it is unique modulo v, and has
the form (8.2.24), where the matrix L is diagonal. This means that if the condition c)
holds, then the MV -optimal density exists and has the form

where

The following statement expresses a somewhat stronger result.

Theorem 8.2.5. The MV-optimal density exists and can be represented in the form
(8.2.29), where

u= argmax G(v),
VEU
G(v) = [!(.~\v kgi(XfV(dX)]
(here akk is the k-th element of the matrix A).
306 Chapter 8

Proof By virtue of Theorem 8.2.1, the MV-optimal density exists. Consider the value

I = min max d .. (p),


peP I:'>i:'>m 11

where dii(P) are the diagonal elements of the matrix D(P). We have

m
I = min max d .. (p) = min max 2. v.d .. (p).
peS I:'>i:'>m 11 peS veD i=1 1 11

The proof of Theorem 8.2.1 indicates that the set S is closed by measure and bounded in
norm and the functional <p is convex and lower semicontinuous in the measure. Using
now the minimax theorem of Levin (1985), p.293, we have

I = maxI(v~
veD
m
I(v) = min 2. v.d .. (p)
peS i_II 11

By virtue of Section 8.2.5, the minimum ICv) is attained at the density Pv and equals G(v).
This fact completes the proof.

Let us remark that under some additional restrictions on the functions glo ... ,gm,
Mikhailov (1984) proved a statement similar to Theorem 8.2.5: the latter is due to
Mikhailov and Zhigljavsky (1988).

8.2.7 Connection with the regression design theory

First we shall point out some general differences between the above exposition and the
classical regression design.
The fIrst difference is the existence of an optimal design. Two conditions a) and b)
concerning the optimality criterion, are required to ensure the existence of the optimal
density. These conditions do not fIgure in the regression design theory; however, they are
not restrictive.
The second one is that the discreteness of optimal measures in the present case is not
desired, even being not admissible.
The third is that if the functions gl, ... ,gm are v-regular (see condition c», then
according to Proposition 8.2.4 a convex functional <I> on N induces the strictly convex
functional <p on the set of densities :P. This allows to guarantee the uniqueness of optimal
densities, e.g. for the E- and MV-criteria.
The fourth is that the structure of optimal densities is the same: for a large class of
criteria, the optimal densities have the form (8.2.24).
Finally, the fIfth difference is that there are no conditions concerning the set 'X except
its measurability, and the functions glo···,gm are not necessarily continuous.
Problems with Global Random Search 307

In some cases the matrix A depends on unknown integrals '1 and, hence, the optimal
densities depend on them, too. In these cases the extremal problem (8.2.1) is analogous to
the nonlinear regression design problem and the optimal density corresponds to the locally
optimal design. Sequential, Bayesian or minimax approaches can be used to determine the
optimal densities.
The optimality results (Theorems 8.2.2 and 8.2.3) are analogous to the equivalence
theorems in classical regression design: apparently, theorems analogous to the dulaity
theorems of Pukelsheim (1980) can also be proved.
308 Chapter 8

8.3 Projection estimation of multivariate regression


This section deals with some theoretical aspests concerning the asymptotically optimal
projection estimation of a multivariate regression. In particular, it is shown that, in a rather
general situation, the uniform random choice of points to evaluate the regression function,
together with Monte Carlo estimates of its Fourier coefficients provide an asymptotically
optimal design and projection estimation procedure. (For a detailed exposition, reference
is made to Zbigljavsky (1985).)

8.3.1 Problem statement


A general scheme of the regression experiment is as follows. Let the evaluation result at a
point xeX be a realization of a (partly) random function

y(x) = f(x) + ~(x),


where ~ is a zero mean random variable (the error at x), f is a regression function given on
xc:R. n , belonging to a functional class F. Obtaining for given XjeX G=1, ... ,N) the
values Yj=Y(Xj), one is interested in estimating the regression function f (x)=Ey(x). The
problem requires a priori information for its correct statement, consisting in indication of
the functional class F and properties of the distributions of the random errors
~(Xj)=Y(Xj)- fexj)' We shall suppose that the errors are uncorrelated, their variance
cr2 (x)=ES2(x) is continuous and bounded but may be unknown, and that r is an
infinite dtmensional functional class (in this case the estimation problem is nonparametric).
Many nonparametric regression function estimation approaches are known, see e.g.
Prakasa Rao (1983): we shall confine ourselves to the projection estimation of a
multivariate regression function, as this problem is not thoroughly studied.
To construct a projection estimate of a regression function f belonging to a functional
class r, one supposes that an increasing sequence {Lm} of m-dimensional spaces is given
such that

(8.3.1)

and the evaluation number N tends to infinity (or is sufficiently large). The projection
estimation problem consists of choosing for the spaces Lm dimensions m=m(N), the
sequence of passive designs {x 1o ••• ,xN}, and a parametric regression estimation method
under the supposition f e Lm in such a way that the obtained estimate

approximates the true regression function fe r in a well-defined sense. Sometimes the


algorithm of selecting the points xlo ... ,xN is assumed to be given and thus it is not
subject to optimization.
Problems with Global Random Search 309

A projection estimate of f is representable as

(8.3.2)

where (f 1,···,f ml is the basis of the space Lm,

are estimates of the unknown parameters of the linear regression


m
2. e. f .(x) (8.3.3)
i=l 1 1

computed using Xj and Yj (j=l, ... ,N). We shall assume that the estimation method of ei is
linear with respect to Yl, ... ,YN.

8.3.2 Bias and random inaccuracies of nonparametric estimates

Let the number N of regression function evaluations as well as the estimation algorithm be
fIxed. We have the following decomposition of the inaccuracy

f-j
corresponding to an estimate of f:

f(x) -j(x)= (f(x) -Ej(x)) + (Ej(x) -j(x)).

Consequently, for an arbitrary metric p on r we have


(8.3.4)

A parametric problem of mathematical statistics is often connected with the second


summand in the right-hand side of (8.3.4) and this term does not depend explicitly on the
unknown function f. The first summand in the right-hand side of (8.3.4) contains the
unknown function f and with this term a compromise minimax criterion

sup
fe:F
p(f - En (8.3.5)

is usually associated. Analogously, instead of the right-hand side of (8.3.4)


310 Chapter 8

is often considered.
If a measure A. may be determined on the set :F" reflecting additional information about
the unknown function fe:F", then - instead of (8.3.5) - the first summand in the right-hand
side of (8.3.4) is usually replaced by the Bayesian criterion

As for p, the quadratic metric is considered below: for this, as easily seen, instead of the
inequality (8.3.4) the equality

2 2
EJ(f(x)-j(x») v(dx)= JCt(x)-Ej(x») v(dx) +

(8.3.6)

takes place, where v(dx) is a given probability measure on (X,:B). Consequently, a mean
square summed inaccuracy

2 2 ~2 ~ ~2
B +V =supJ(f-J) dV+EJ(EJ-J) dv (8.3.7)
fer
completely characterizes the error of the method. The ftrst term in (8.3.7), i.e. the quantity

2 ~ 2
B = sup J(f -EJ) dv
fer
is the square of the so-called bias inaccuracy, while the second term

is the mean square of the random inaccuracy. Since for all xe X the variance of the
estimate

j(x)

is equal to
Problems with Global Random Search 311

the square of the random inaccuracy Y can be written in the form

y2 = J( var 1(x»)v(dx). (8.3.8)

Consider now some properties of the random inaccuracy of projection estimates.


Let the number N and the space Lm be fixed, F(x)=(f 1(x), ... ,f m(x»' be the vector of
the v-orthonormal base functions of the space Lm, the evaluations of the regression
function be preformed at points xl, ... ,xN,

9.1 = I f(x)f 1.(x)v(dx)

be the Fourier coefficients of f with respect to the functions fi (i= 1,... ,m),
9=(91 ,... ,e m )',
?
J(x) = eF(x)
1\'

be the regression function estimate where

e
1\

is the vector of linear estimates of e. Assume that o2(x)= cr2=const. for all XE X.
By virtue of (8.3.2) and (8.3.8), y2 is representable as

According to the classical Gauss-Markov theorem applied in regression analysis, the best
linear unbiased estimates are the least square estimates for which, in particular, the
quantity
1\
trcove
is minimal. The main object of this subsection is the lower estimation of Y: therefore we
shall consider only the least square estimates for which

(8.3.9)

where
312 Chapter 8

is the normalized information matrix of the experimental design that can be written as
follows

{ Xl"'" XN} (8.3.10)


lO N = 1/ N, ... , 1/ N

Proposition 8.3.1. Let the functions 11, ... ,1 m be v-orthonormal and uniformly lower
bounded by a constant K, cr 2 (x)=cr2<oo, an estimate

of the regression function 1 be the projection (8.3.2), where


A
8.
1

are linear statistics, and let the matrix M(lON) be nondegenerate. Then the inequality

(8.3.11)

holds.
The proof is based on the following statement.

Lemma 8.3.1. If A is a positive definite matrix of order mxm, then the inequality

-1 2
tr A ;;;: m Itr A (8.3.12)

holds.

Proof Let B be a positive definite matrix of order mxm and A1, ... ,Am be its eigenvalues.
By virtue of
m m
trB = I,A., det B= Ill...
i=l 1 i=l 1

the inequality between geometric and arithmetic means


Problems with Global Random Search 313

11m
ITA.J
(i=1 1
: :; m- 1iI.=1 A.
1

can be rewritten like

11m
trB ~ m(det B) . (8.3.13)

Applying the latter inequality to the matrices B=A-l and B=A, we obtain that

11m
tr A-I ~ m(det A-I) = m(det A) -11m ~

-1
~m«trA)/m) =m 2/trA.
The lemma is proved.

Proof of Proposition 8.3 .1. Let e(dx) be any approximate design, i.e. any probability
measure on (')(.,:8) and

M(e) = fF(x)F'(x)e(dx).

In particular, e(dx) may be of the form (8.3.10).


For all approximate designs e such that det M(e);t:O the application of inequality
(8.3.12) to the matrix A=M(e) yields

-1 m 2
tr(M(e» ~ m 2/tr M(e) = m 2/ L f f .(x)e(dx) ~ m/K
i=1 1

Considering also (8.3.9), we obtain (8.3.11): the proof is completed.


Inequality (8.3.11) implies that there exists a constant c>O, independent of m and N,
such that v22:cm/N. Therefore the following statement (that will playa substantial role
below) holds.

Theorem 8.3.1. Let cr 2 (x)=cr 2 =const, N~oo, m=m(N)~oo, m/N~O. Then the
convergence order of the mean square of the inaccuracy (8.3.7) can not be less than

(8.3.14)

for any projection estimate of the form of (8.3.2), any method of linear parametric
estimation and any passive design (xlo ... ,xN) sequence where
314 Chapter 8

is the minimal value of the bias inaccuracy B.

8.3.3 Examples of asymptotically optimal projection procedures with


deterministic designs

Omitting the proofs which are rather tedious, we formulate here two examples of
asymptotically optimal projection procedures involving deterministic designs (i.e. rules for
the choice of points xl, ... ,xN).
Let Z be the set of nonnegative integers, K=Zn be the multi-index set,
k=(k}, ...kn)e K be a multi-index, {/k(x), ke K} be a complete v-orthonormal set of the
functions /k(x)=exp{ -21ti(k,x)} on X=[O,l]n, Lm be the set of linear combinations of the
functions fk(X) corresponding to subsets Kq ofK:

where

m=cardK q = I, 1,
kEKq

11.11 is a given positive function on K, and (k,x) is the scalar product of ke K and xe X.
The functional set Rna for integer a~1 consists of real functions defined on
X=[O,I]n continuous partial derivatives

Define first the functional class f" =f"(Hna), as the subset of Hna containing all periodical
functions of Rna with period 1, respectively to each coordinate. For each fe f"(Rna),
one has Iek I~LII k II-a, where L is a constant,

ek= Jf(x)exp {- 21ti(k,x) }dx, ke K


Problems with Global Random Search 315

are the Fourier coefficients of f and


n
Ilk II = TImax{l,k.}. (8.3.15)
j=1 J

Theorem 8.3.2. Set q=Nl/2a, select the lattice grids EN(5) of Section 2.2.1 as the
experimental design and estimate the Fourier coefficients 9k by the least square algorithm.
Then the summed mean square inaccuracy (8.3.7) of the projection estimate of a
regression function f E f(Hn a) decreases with the rate

N ~oo. (8.3.16)

This is the optimal decrease rate of B2+V2 on f(Hn a) for any linear estimation method,
choice q=q(N), and any experimental design sequence: the proof of this statement can be
found in Zhigljavsky (1985) and Ermakov and Zhigljavsky (1984).
Turn now to the functional set Wna consisting of functions f on X=[O,I]n which
have the derivatives

(0
1
:i t .),
~ ~ ~ an, 0 ~ ~. ~ an, ~ = i=l 1

define:F' =:F'(Wna) as the subset of W na containing all periodical functions with period 1,
respectively to each coordinate.
Zhigljavsky (1985) proved the analogue of Theorem 8.3.2 for the class :F'(wna). Its
formulation would coincide with that of Theorem 8.3.2, if we substitute :F'(wna), the
cube grids 3N(1),

and N-l+n/2a (N~oo) for :F'(Hna), the lattice grids 3N(5), and the relations (8.3.15) and
(8.3.16), respectively.

8.3 .4 Projection estimation via evaluations at random points

We shall suppose below that the points X}'X2, ... at which the regression function f is
evaluated are random. In this case the most popular linear parametric estimation methods
316 Chapter 8

are the least square and ordinary Monte Carlo. The least square method minimizes the
random inaccuracy V, but introduces a bias leading to a slight increase of the bias
inaccuracy B. Usually the increase of B is insignificant from the asymptotic point of view
and thus the least square method may usually be included into an asymptotically optimal
projection estimation procedure. The drawback of the least square estimation method is its
numerical complexity. The standard Monte Carlo method is much simpler than the least
square approach and produces unbiased estimates of the Fourier coefficients. This way,
its use gives the minimal value of the bias inaccuracy B, i.e.

(8.3.17)

On the other hand, the random inaccuracy V is larger than for the least square method. We
shall show that the increase of V is often insignificant from the asymptotic point of view
and thus the Monte Carlo estimation method may also be included into the asymptotically
optimal projection estimation procedure.
Let x 1> ..• ,xN be independent and identically distributed with a positive (on X) density
p(x) and {f 1,... ,f m} be a base of Lm· The standard Monte Carlo estimates of the Fourier
coefficients

e. = rf(x)f .(x)v(dx),
1 1
i = 1, ... ,m,
are of the form

(8.3.18)

Set

1/2
g/x) = (f2(x) + a 2(x») f /x),

The following statement is valid.

Lemma 8.3.2. Let xI, ... ,xN be independent and distributed according to a positive
probability density p on X, the evaluation errors S(Xj)=y(xj)-f(xj) be uncorrelated, their
variation var(s(x»=a2 (x) be bounded. Then the estimates (8.3.18) are unbiased and

cov ~ = E(~ - e)(~ - e}= N- 1 [r g(x)g'(x) v(dx) - ee'] (8.3.19)


p(x)

The proof can be found in Zhigljavsky (1985).


The unbiasedness of
Problems with Global Random Search 317

~ yields (8.3.17) for the bias inaccuracy B. Consider now the random error V:
2 2
V =V (p)=Ef
(rnI,8·f·(x)- I,8.f.(x)
rn,,} (dx)=
i =1 1 1 i=1 1 1

=
.
I E(~. -
1
8.
1
)(8 - 8 )f f .(x)f
~ ~ 1 ~
(x)v(dx) =
l,~=1

f
= N-1 tr D(p) F(x)F' (x)v(dx),

where

D(p) = Ncov "


8

is the normalized covariance matrix representable in the form (8.2.3) with the substitution
8 for't. Hence, it follows that the optimal choice criterion for the density p is linear and
has the form (8.2.8) with

L= fF(x)F'(x)v(dx);

and the optimal density has the form (8.2.24). The minimal value of the squared random
inaccuracy is attained at this density and equals

(8.3.20)

Due to Theorem 8.3.1, (8.3.20) is not less than cm/N, where c is a positive number. At
the same time, in a rather general case, there exist a density p and a constant C~l such that

(8.3.21)

The inequality (8.3.21) is valid, for instance, for the cases considered in Section 8.3.3
and in the common case (in optimization theory) when v(,X)<oo, functions f(x) and a 2(x)
are bounded and the base functions f i (i=l, ... ,m) are v-orthonormal and uniformly
bounded with respect to i. In the latter case, the uniform (on 'X) density can be chosen as
p. It follows that for orthonormal functions f 1,... ,f m there holds

2
V (p)=N- 1(((2
f f (x)+a 2 (x)
)rn2}} rn 2
i~/i(X) p(x) (dx)- i~18i

If (8.3.21) is fulfilled, then the asymptotic relation


318 Chapter 8
2
V (p) ""mIN for m ~~ N ~oo (8.3.22)

is valid, i.e. the random error reaches the optimal order of decrease for m, N~oo. Taking
into account that the bias inaccuracy is minimal we get the following statement.
Theorem 8.3.3. Let (8.3.21) be fulfilled for some constant C~l and positive density p
on 'X. Then the independent random choice of points xl ,... ,xN in accordance with the
density p, the method of parametric estimation (8.3.18), and a suitable choice of m=m(N)
(these leading to equal decrease orders of:E2CF,Lm) and (8.3.22» form an asymptotically
optimal procedure of projection estimation of a regression function f E f'.
As indicated above, for the functional classes f' and sequences of spaces (Lm}
considered in Section 8.3.3 the condition (8.3.21) holds, in particular, for the uniform
density on 'X. So the projection estimation procedure of Theorem 8.3.3 with the uniform
density is asymptotically optimal. Note also that the above random designs have
advantages, compared with the deterministic ones given in Section 8.3.3, viz., they are
simpler to construct and they posses the composite property described in Section 2.2.1.
Let us turn now to the construction and study of the least square parametric estimation
through regression function evaluations at independent random points.
Let m be fixed, the base functions f 1, ...,f m be v-orthonormal, the points x b ... ,xN at
which a regression function is being evaluated be random, independent and identically
distributed with a density p (with respect to the measure v) which is v-almost everywhere
positive (p>O (modv), fpdv=l) on 'X, the evaluations of f be uncorrelated and their
variance a 2(x)=ES2(x) be uniformly bounded.
Denote by 8=(81 ,... 8 m )' the vector of Fourier coefficients of f with respect to the
base functions f b.··,f m and set

F(x)= (f l(x),···,f m(x»)', r(x) = f<x) - 8'F(x).

The estimation of the unknown parameters (i.e. the Fourier coefficients 8i) by the least
square method is as follows. Supposing that f is a linear combination of the functions
f 1o···,f m let us derive the simultaneous equations
m
I8ofo(xo) =y(xo), j = 1, ... ,N,
i=1 1 1 J J

mUltiply now the i-th equation by (p(Xj)-I/2 and obtain A8=Y, where
Problems with Global Random Search 319

Suppose now that the matrix A'A is nondegenerate. Then the least square estimate of 8
will have the form
A -1
8=(A'A) A'Y. (8.3.23)

Since, in general, f can not be represented as a linear combination of f 1,... ,f m, the


estimate (8.3.23) may be biased. The main purpose of the result given below is the
estimation of the inaccuracy of (8.2.23).
According to the central limit theorem, the sequence of matrices

converges for N--+oo to the unit matrix 1m; the order of convergence rate equals O(N-1!2).
It follows that in case of existence of the inverse matrix (A'A)-l, the asymptotic relation

N --+00 (8.3.24)

is valid.
Consider now the case of a v-regular system of functions f 1,... ,f m with respect to the
measure v, i.e. such a system that for all N~m the measure of the point set {Xl ,... ,xN }
for which the matrix A'A is degenerate equals zero.

Theorem 8.3.4. Let f, fiE L2('X.,v), i=l, ... ,m, m be fixed, N--+oo , and the collection
of v-orthonormal functions f 1,... ,f m be regular. Then for the estimate (8.3.23) the
asymptotic representations

EO = e + N- l J[r(X)F(x{~/~(X)}P(X)}(dX) + O(N- 2), (8.3.25)

A -1 r2(x) + cr 2(x) ( -2)


cov8=N J p(x) F(x)F'(x)v(dx)+0 N (8.3.26)

are valid for N--+00 .


The proof can be found in KOljakin (1983) or in Zhigljavsky (1985).
320 ChapterB

Let us comment on the above assertion. First, if for the inverse matrix (A'A)-1 in
(8.3.23) its initial approximation N-1I m is substituted, then the standard Monte Carlo
estimate (8.3.18) is obtained. Second, ifp is chosen as

m 2
p(x) = m- 1 L f .(x), (8.3.27)
i=1 1

then from (8.3.25) we obtain

A
EO=O+O (
N -2) , N ~oo.

It is not difficult to show that for the density (8.3.27), the exact unbiasedness
A
EO=O
also takes place. Third, analogously with the case when the Fourier coefficients are
estimated by standard Monte Carlo method, the minimzation problem of the convex
functional of the matrix

+ (J"2(x) (-1)
Jr2(x)p(x) A
F(x)F' (x)v(dx) = N covO + 0 N , N~oo,

with respect to p can be stated. Ignoring the biasedness of the estimates obtained, the
results will be similar to those stated in Section 8.2.
Note finally that if the function collection U1,... ,f m} is v-regular, N=km, the so-
called Ermakov-Zolotukhin points that have the joint density (see Ermakov (1975))

for k groups of m points, and the least square estimates of the Fourier coefficients of f
with respect to h,... ,f m are used, then by virtue of results in Ermakov (1975) one has

(N ~oo, m ~oo)

for the summed inaccuracy (8.3.7), for details see Zhigljavsky (1985). Comparing this
estimate with (8.3.14), we obtain that the Ermakov-Zolotukhin points can not be generally
included into an asymptotically optimal projection estimation procedure.
REFERENCES

Aluffi-Pentini F., Parisi V. and Zirilli F. (1985) Global optimization and stochastic
differential equations. J. Optimiz. Theory and Applic., 47, No.1, 1-16.

Anderson A. and Walsh G.R. (1986) A graphical method for a class of Branin trajectories.
J. Optimiz. Theory and Applic., 49, No.3,367-374.
Anderssen R.S. and Bloomfield P. (1975) Properties of the random search in global
optimization. J. Optimiz. Theory and Applic., 16, No.5/6, 383-398.
Anily S. and Federgruen A. (1987) Simulated annealing methods with general acceptance
probabilities. J. Appl. Probab., 24, No.3, 657-667.

Archetti F. and Betro B. (1979) A probabilistic algorithm for global optimization. Calcolo,
16, No.3, 335-343.

Archetti F. and Betro B. (1980) Stochastic models and optimization. Bolletino della
Unione Matematica Italiana, 17-A, No.5, 225-301.
Archetti F. and Schoen F. (1984) A survey on the global optimization problem: general
theory and computational approaches. Ann. Oper. Res., 1,87-110.
Ariyawansa K.A. and Templeton J.G.C. (1983) On statistical control of optimization.
Optimization, 14, No.2, 393-410.
Avriel M. (1976) Nonlinear Programming: Analysis and Methods, Prentice-Hall,
Englewood Cliffs, New Jersey e.a.
Baba N. (1981) Convergence of a random oprimization method for constrained
optimization problems. J. Optimiz. Theory and Applic., 1981,33, No.4, 451-461.
Basso P. (1982) Iterative methods for the localization of the global maximum. SIAM J.
Numer. Anal., 19, No.4, 781-792.
Bates D. (1983) The derivative of X'X and its uses. Technometrics, 25, No.4, 373-376.
Batishev D.l. (1975) Search Methods of Optimal Construction. Soviet Radio, Moscow,
216 p. (in Russian).
Batishev D.l. and Lyubomirov A.M. (1985) Application of pattern recognition methods to
searching for a global minimum of a multivariate function. Problems of Cybernetics,
Vo1.122, (ed. V.V.Fedorov), 46-60 (in Russian).
Beale E.M.L. and Forrest J.J.H. (1978) Global optimization as an extension of integer
programming. In: Towards Global Optimization Vol.2. North Holland, Amsterdam e.a.,
131-149.

Bekey G.A. and Ung M.T. (1974) A comparative evaluation of two global search
algorithms. IEEE Trans. on Systems, Man and Cybernetics, No.1, 112-116.
321
322 References

Berezovsky A.1. and Ludvichenko V.A. (1984) Sequential searching algorithms of


minimal values of differential functions from some classes. Numerical Methods and
Optimization. Kiev, 27-32 (in Russian).
Betro B. (1983) A Bayesian nonparametric approach to global optimization. Methods of
Oper. Res., 45, No.1, 45-79.
Betro B. (1984) Bayesian testing of nonparametric hypothesis and its application to global
optimization. J. Optimiz. Theory and Applic., 42, No.1, 31-50.

Betro B. and Schoen F. (1987) Sequential stopping rules for the multistart algorithm in
global optimization. Mathern. Programming, 38, No.2, 271-286.

Betro B. and Vercellis C. (1986) Bayesian nonparametric inference and Monte Carlo
optimization. Optimization, 17, No.5, 681-694.

Betro B. and Zielinski R. (1987) A Monte Carlo study of a Bayesian decision rule
concerning the number of different values of a discrete random variable. Commun. Statist.:
Simulation, 16, No.4, 925-938.

Bhanot G. (1988) The Metropolis algorithm. Rep. Progr. Phys., 429-457.

Billingsley. P. (1968) Convergence of Probability Measures. Wiley, N.Y. e.a.


Blum Z.R. et al.(1959) Central limit theorems for interchangeable processes. Canadian J.
Mathern., 10,222-229.

Boender C.G.E., and Zielinski R. (1985) A sequential Bayesian approach to estimating the
dimension of a multinomial distribution. In: Sequential Methods in Statistics. Banach
Center Publications, Vo1.16, Polish Scientific Publishers, Warsaw, 37-42.
Boender G., Rinnoy Kan A., Stougie L., and Timmer G. (1982) A stochastic method for
global optimization. Mathern. Programming, 22, No.1, 125-140.
Boender, C.G.E. (1984) The generalized multinomial distribution: A Bayesien analysis and
applications. Ph. D. Thesis, Erasmus University, Rotterdam.
Boender, C.G.E., and Rinnoy Kan A.R.G. (1987) Bayesian stopping rules for multistart
global optimization methods. Mathern. Programming, 37, No.1, 59-80.

Bohachevsky I.O., Johnson M.E., and Stein M.L. (1986) Generalized simulated annealing
for function optimization. Technometrics, 28, No.3, 209-217.
Branin F.R. (1972) A widely convergent method for finding multiple solutions of
simultaneous non-linear equations. mM J. Res. Develop., 16,504-522.
Branin F.R., and Roo S.K. (1972) A method for finding multiple extrema of a function of
n variables. In: Numerical Methods for Nonlinear Optimization. Academic Press, London
e.a., 231-237.
References 323

Bremerman H.A. (1970) Method of unconstrained global optimization. Mathern.


Biosciences, 9, No.1, 1-15.
Brent R.P. (1973) Algorithms for Minimization without Derivatives. Prentice-Hall, New
Jersey, 195 p.

Brooks S.H. (1958) Discussion of random methods for locating surface maxima. Oper.
Res., 6, 244-251.

Brooks S.H. (1959) A comparison of maximum-seeking methods. Operations Research,


7,430-457.

Bulatov V.P. (1987) Methods of solving of multiextremal problems. In: Methods of


Numerical Analysis and Optimization (eds B.A. Beltjukov, and V.P. Bulatov), Nauka,
Novosibirsk, 133-157 (in Russian).

Chen J., and Rubin H. (1986) Drawing a random sample from a density selected at
random. Comput. Statist. and Data Analys. 4, No.4, 219-227.

Chernousko F.L. (1970) On the optimal search of extremum for unimodal functions.
USSR Comput. Mathern. and Mathern. Phys., 10,No.4, 922-933.
Chichinadze V.K. (1967) Random search to determine the extremum of a function of
several variables. Engineering Cybernetics, No.5, 115-123.

Chichinadze V.K. (1969) The '¥-transform for solving linear and nonlinear programming
problems. Automatica, 5, No.3, 347-355.
Chuyan O.R. (1986) Optimal one-step maximization of twice differentiable functions.
USSR Comput. Mathern. and Mathern. Phys., 26, No.3, 381-397.
Clough D.J. (1969) An asymptotic extreme value sampling theory for the estimation of a
global maximum. Canad. Oper. Res. Soc. J. 7, No.1, 105-115.
Cohen J.P. (1986) Large sample theory for fitting an approximate Gumbel model to
maxima. Sankhya, A48, 372-392.
Cook P. (1979) Statistical inference for bounds of random variables. Biometrika, 66,
No.2, 367-374.
Cook P. (1980) Optimal linear estimation of bounds ofrandom variables. Biometrika, 67,
No.1, 257-258.

Corana A., Marchesi M., Martini c., and Ridella S. (1987) Minimizing multimodal
functions of continuous variables with the simulated annealing algorithm. ACM Trans. on
Mathern. Software, 13, No.3, 262-280.

Corles C. (1975) The use ofregions of attractions to identify global minima. In: Towards
Global Optimization. VoU, North Holland, Amsterdam e.a., 55-95.
324 References

Crowder H.P., Dembo R.S. and Mulvey J.M. (1978) Reporting computational
experiments in mathematical programming. Mathern. Programming, 15,316-329.

Csendes T. (1985) A simple but hard-to-solve global optimization test problem. Presented
at the IIASA Workshop on Global Optimization (held in Sopron, Hungary).

Csorgo S. and Mason D.M. (1989) Simple estimators of the endpoint of a distribution. In:
Extreme Value Theory, Oberwolfach, 1987 (eds.Hiisler J. and Reiss R.-D.) Lecture Notes
in Statistics, Springer, Berlin e.a.

Csorgo S., Deheuvels P. and Mason D.N. (1985) Kernel estimates of the tail index of a
distribution. Ann. Statist., 13, No.3, 1050-1077.

Dannenbring D.G. (1977) Procedures for estimating optimal solution values for large
combinatorial problems. - Management Science, 23, 1273-1283.

de Biase L. and Frontini F. (1978) A stochastic method for global optimization: its
structure and numerical performance. In: Towards Global Optimization Vo1.2. North
Holland, Amsterdam e.a., 85-102.

de Haan L. (1970) On Regular Variation and its Application to the Weak Convergence of
Sample Extremes. North Holland, Amsterdam e.a., 104 p.

de Haan L. (1981) Estimation of the minimum of a function using order statistics, J. Amer.
Statist. Assoc., 76, No.374, 467-475.

Deheuvels P. (1983) Strong bounds for multidimensional spacings. Z.Wahrsch. verw.


Gebiete, 64,411-424.

Dekkers A.L.M. and de Haan L. (1987) On a consistent estimate of the index of an extreme
value distribution. Rept. Cent. Math. and Comput. Sci., No. MS - R871O, 1-15.

Demidenko E.Z. (1989) Optimization and Regression. Nauka, Moscow (in Russian).

Demyanov V.F. and Vasil'ev L.V. (1985) Nondifferentiable Optimization. Springer-


Verlag, New York e.a., 452 p.

Dennis J.E. and Schnabel R.B. (1983) Numerical Methods for Unconstrained Optimization
and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, New Jersey e.a.

Devroye L. (1978) Progressive global random search of continuous functions. Mathern


Programming, 15, 330-342.

Devroye L. (1986) Nonuniform Random Variate Generation. Springer, Berlin e.a., 843 p.

Devroye L. and Gyorfi L. (1985) Nonparametric Density Estimation: the Ll View. Wiley,
New York e.a.

Devroye L.P. (1976) On the convergence of statistical search. IEEE Transactions on


System, Man and Cybernetics, 6, 46-56.
References 325

Dixon L.C.W. and Szego G.P., eds (1975) Towards Global Optimization, VoLl. North
Holland, Amsterdam e.a., 472 p.
Dixon L.C.W. and Szego G.P., eds (1978) Towards Global Optimization, Vol.2. North
Holland, Amsterdam e.a., 364 p.

Dorea C.C.Y. (1983) Expected number of steps of a random optimization method. J.


Optimiz. Theory and Applic., 39, No.2, 165-171.

Dorea C.C.Y. (1986) Limiting distribution for random optimization methods. SIAM J.
Control and Optimization, 24, No.1, 76-82.

Dorea C.C.Y. (1987) Estimation of the extreme value and the extreme points. Ann. Inst.
Statist. Mathern., 39, No.1, 37-48.

Dunford N. and Schwartz J.T. (1958) Linear Operators, Part 1.: General Theory,
Interscience Publishers, New York e.a.

Duran B.S., and Odell P.L. (1977) Cluster Analysis: A Survey. Springer, Berlin e.a.
Ermakov S.M. (1975) Monte Carlo Methods and Related Problems. Nauka, Moscow, 472
p. (in Russian).

Ermakov S.M. (1983) Mathematical Theory of Experimental Design. Nauka, Moscow,


392 p.(in Russian).
Ermakov S.M. and Mitioglova L.V. (1977) On a global search method based on estimation
of the covariance matrix. Automatika and Computers, No.5, 38-41 (in Russian).
Ermakov S.M. and Zhigljavsky A.A. (1983) On the global search of a global extremum.
Theory of Probab. and Applic., 28,No.l, 129-136.
Ermakov S.M. and Zhigljavsky A.A. (1984) Nonparametric estimation and the
asymptotical optimum design of an experiment. Vestnik of Leningrad University, No.7,
20-27.
Ermakov S.M. and Zhigljavsky A.A. (1985) Monte Carlo method for estimating
functionals of eigen-measures of linear integral operators. USSR Comput. Mathern. and
Mathern Phys., 25, No.5, 666-679.
Ermakov S.M. and Zhigljavsky A.A. (1987) Mathematical Theory of Optimal
Experiments. Nauka, Moscow, 320 p.(in Russian).
Ermakov S.M., Zhigljavsky A.A. and Kondratovich M.V. (1988) Reduction of the
problem of random estimation of the function extremum value. Dokl. Akad. Nauk SSSR,
302, No.4, 796-798.
Evans O.H. (1963) Applied multiplex sampling. Technometrics, 5, No.3, 341-359.

Evthusenko Yu.G. (1985) Numerical Optimization Techniques. Springer, Berlin e.a.


326 References

Evtushenko Yu.G. (1971) Algorithm for finding the global extremum ofa function (case of
a non-uniform mesh). USSR Comput. Mathern. and Mathern Phys., 11, No.6, 1390-
1403.
Falk M. (1983) Rates of uniform convergence of extreme order statistics. Ann. Instit.
Statist. Mathern. 38, part A, No.2, 245-265.

Faure H. (1982) Discrepance de suites associees a un systeme de numeration (en


dimension s). Acta Arithm., 41, p.337-351.

Fedorov V.V. (1979) Numerical Maximin Methods. Nauka, Moscow, 280 p. (in Russian).

Fedorov V.V., ed. (1985) Problems of Cybernetics, No. 122: Models and Methods of
Global Optimization. USSR Academy of Sciences, Moscow (in Russian).

Feller W. (1966) An Introduction to Probability Theory and its Applications, Vol.2,


Wiley, New York e.a.

Fletcher R. (1980) Practical Methods of Optimization. Wiley, New York e.a.

Gabrielsen G. (1986) Global maxima of real-valued functions. J. Optimiz. Theory and


Applic., 50, No.2, 257-266.

Galambos J. (1978) The Asymptotic Theory of Extreme Order Statistics. Wiley, New York
e.a.

Galperin E.A. (1988) Precision, complexity and computational schemes of the cubic
algorithm. J. Optimiz. Theory and Applic., 57, No.2, 223-238.

Galperin E.A. and Zheng Q. (1987) Nonlinear observation via global optimization
methods: measure theory approach. 1. Optimiz. Theory and Applic., 54, No.1, 63-92.
Ganshin G.S. (1976) Calculation of the maximal value of functions. USSR Comput.
Mathern. and Mathern. Phys., 16, No.1, 30-39.
Ganshin G.S. (1977) Optimal algorithms of calculation of the highest value of functions.
USSR Comput. Mathern. and Mathern. Phys., 17, No.3, 562-571.

Ganshin G.S. (1979) The optimization of algorithms on classes of nets. USSR Comput.
Mathern. and Mathern. Phys., 19, No.14, 811-821.

Gaviano M. (1975) Some general results on the convergence of random search algorithms
in minimization problems. In: Towards Global Optimization, VoU, North Holland,
Amsterdam e.a., 149-157.
Ge R.P. (1983) A filled function method for finding a global minimizer. Presented at the
Dundee Biennial Conference on Numerical Analisys.
Ge R.P. and Qin Y.F. (1987) A class of filled functions for finding global minimizers of a
function of several variables. J. Optimiz. Theory and Applic., 54, No.2, 241-252.
References 327

Geman S. and Hwang c.-R. (1986) Diffusions for global optimization. SIAM J. Control
and Optimization, 24, No.5, 1031-1043.

Golden B.L. and Alt F.B. (1979) Interval estimation of a global optimum for large
combinatorial problems. Naval Res. Logistics Quaterly, 26, No.1, 69-77.

Gomulka J. (1978) Two implementations of Branin's method: numerical experience. In:


Towards Global Optimization, Vo1.2, North Holland, Amsterdam e.a., 151-163.

Griewank A.O. (1981) Generalized descent for global optimization. J. Optimiz. Theory
and Applic., 34, No.1, 11-39.

Gumbel E.J. (1958) Statistics of Extremes. Columbia University Press.

Haines L.M. (1987) The application of the annealing algorithm to the construction of exact
optimal design for linear regression models. Technometrics, 29, No.4, 439-448.

Hall P. (1982) On estimating the endpoint of a distribution. Ann. Statist., 10, No.2, 556-
568.
Hansen E. (1979) Global optimization using interval analysis: the one-dimensional case. J.
Optimiz.Theory and Applic., 29, No.3, 331-334.

Hansen E. (1980) Global optimization using interval analysis: the multidimensional case.
Numer. Math., 34, 247-270.

Hansen E. (1984) Global optimization with data perturbations. Comput. Oper. Res., 11,
No.2, 97-104.

Hardy J. (1975) An implemented extension of Branin's method. In: Towards Global


Optimization. North Holland, YoU, Amsterdam e.a., 117-142.

Harris T.E. (1963) The Theory of Branching Processes. Springer, Berlin e.a.

Hartley H.O. and Pfaffenberger P. (1971) Statistical control of optimization. In:


Optimizing Methods in Statistics. Academic Press, New York e.a., 281-300.

Hartley H.O. and Ruud P.G. (1969) Computer optimization of second order response
surface designs. In: Statistical Computations, Proceedings of a Conference held at the
University of Wisconsin, Madison, Wisconsin, April 29-30, 1969,441-462.

Hock W. and Schittkowski K. (1981) Test Examples for Nonlinear Programming Codes.
Lecture Notes in Economics and Mathematical Systems, No.187, Springer, Berlin e.a.,
177 p.

Horst R. (1986) A general class of branch-and-bound methods in global optimization with


some new approaches for concave minimization. J. Optimiz. Theory and Applic., 51,
No.2, 271-291.
328 References

Horst R. and Tuy H. (1987) On the convergence of global methods in multiextremal


optimization. J. Optimiz. Theory and Applic., 54, No.2, 253-271.

Hua L.K. and Wang Y. (1981) Applications of Number Theory to Numerical Analysis.
Springer, Berlin e.a.

Ichida K. and Fujii Y. (1979) An interval arithmetic method for global optimization.
Computing, 23, 85-97.
Incerti S., Parisi V. and Zirilli F. (1979) A new method for solving nonlinear simultaneous
equations. SIAM J. on Numerical Analysis, 16,779-789.
Isaac R. (1988) A limit theorem for probabilities related to the random bombardment of a
square. Acta Mathematica Hungarica, 51, No.I-2, 85-97.

Ivanov V.V. (1972) On Optimal Algorithms of minimization in the class of functions with
the Lipschitz condition, Information Processing, 2, p. 1324-1327.

Ivanov V.V., Girlin S.K. and Ludvichenko V.A. (1985) Problems and results of global
optimization for smoothing functions. Problems of Cybernetics, Vo1.122.(ed.Fedorov
V.V.), 3-13 (in Russian).

Jacobsen S. and Torabi M. (1978) A global minimization algorithm for a class of one-
dimensional functions. J. of Mathern. Analysis and Applic., 62, 310-324.

Jacson R. and Mulvey J. (1978) A critical review of comparisons of mathematical


programming algorithms and software (1953-1977). J. of Research of the National Bureau
of Standards, 83, No.6, 563-584.

Janson S. (1986) Random coverings in several dimensions. Acta Mathematica, 156,83-


118.

Janson S. (1987) Maximal spacings in several dimensions. Ann. Probab., 15, No.1, 274-
280.

Kantorovich L.V. and Akilov G.P. (1977) Functional Analysis. Nauka, Moscow, 744 p.
(in Russian).

Kashtanov, Yu. N. (1987) The rate of convergence towards its own distribution of an
integral operator in the generation method. Vestnik of Leningrad University, No.1, 17-21.

Katkovnik V.Ya. (1976) Linear Estimators and Stochastic Optimization Problems. Nauka,
Moscow, 488 p. (in Russian).

Khairullin R.H. (1980) On the estimation of the critical parameter of the branching
processes of a special kind. Izvestiya Vuzov, Matematika. No.8, 77-84 (in Russian).
Khasminsky R.Z. (1965) Application of random noise in optimization and recognition
problems. Problems ofInformation Transfer. 1, No.3, 113-117 (in Russian).
References 329

Kiefer J. (1953) Sequential minimax search for a maximum. Proc. Amer. Math. Soc., 4,
No.3, 502-506.

Kirkpatrick S., Gelatt C.D. and Vecchi M.P. (1983) Optimization by simulated annealing.
Science, 220, pp. 671-680.

Kleibohm K. (1967) Bemerkungen zum Problem der Nichtkonvexen Programmierung.


U nternehmensforschung, 11, pp. 49-60.

Kolesnikova C.N., Kornikov V.V., Rozhkov N.N. and Khovanov N.V. (1987)
Stochastic processes with equiprobable monotone realizations as models of information
deficiency. Vestnik of Leningrad University, No.1, 21-26.

Korjakin A.I. (1983) The estimation of a function from randomized observations. USSR
Comput. Mathern. and Mathern. Phys., 23, No.1, 21-28.

Korobov N.M. (1963) Number-Theoretical Methods in Approximate Analysis. Fizmatgiz,


Moscow, 224 p. (in Russian).

Kushner H.J. (1964) A new method of locating the maximum point of an arbitrary
multipeak curve in the presence of noise. Trans.ASME, ser. D, 86, No.1, 97-105.
Kushner H.J. (1987) Asymptotic global behavior of stochastic approximation and
diffusion with slowly descreasing noise effects: global minimization via Monte Carlo.
SIAM J. Appl. Math., 47, No.1, 169-184.
Lbov G.S. (1972) Training for extremum determination of a function of variables
measured on nominal names scale. In: Second Intern. Conf. on Artificial Intelligence.
London, 418-423.

Levin V.L. (1985) Convex Analysis in Spaces of Measurable Functions and Applications
in Mathematics and Economics. Nauka, Moscow, 316 p. (in Russian).

Levy A.V. and Montalvo A. (1985) The tunneling algorithm for the global minimization of
functions. SIAM J. Sci. Statist. Comput., 6, No.1, 15-29.

Lindgren G. and Rootzen H. (1987) Extreme values: Theory and technical applications.
Scand. J. Statist., 14, No.4, 241-279.
Loeve M. (1963) Probability Theory. D. Van Nostrand, New York e.a.
Low W. (1984) Estimating an endpoint of a distribution with resampling methods. Ann.
Statist., 12, No.4, 1543-1550.
Makarov I.M., Radashevich Yu. B. (1981) Statistical estimation of accuracy in constrained
extremal problems. Automation and Remote Control, No.3, 41-48.
Mancini L.J. and McCormick G.P. (1979) Bounding global minima with interval
arithmetic. Oper. Res. 27, No.4, 743-754.
330 References

Mancini L.J., and McCormick G.P. (1976) Bounding global minima. Mathern. of Oper.
Res., 1, No.1, 50-53.

Mandal N.K. (1981) D-optimal designs for estimating the optimum point in a multifactor
experiment. Calcutta Statist. Assoc. Bull., 30, No.119-120, 145-169.

MandaI N.K. (1989) D-optimal designs for estimating the optimum point in a quadratic
response surface - rectangular region. J. Statistical Planning and Inference, 23, 243-252.

Marti K. (1980) On accelerations of the convergence in random search methods. Oper.


Res. Verfahren, 37, 391-406.
Masri S.F., Bekey G.A. and Safford F.B (1980) A global optimization algorithm using
adaptive random search. Appl. Mathern. Comput. 7, 353-375.
McCormick G.P. (1972) Attempts to calculate global solutions of problems which may
have local minima. In: Numerical Mehods for Nonlinear Optimization. Academic Press,
London e.a., 209-221.

McCormick G.P. (1983) Nonlinear Programming. Theory, Algorithms, and Applications.


Wiley, N.Y., 444 p.

McMurty G.J. and Fu K.S. (1966) A variable structure automaton used as a multimodal
searching technique. IEEE Trans. on Automatic Control, 11, No.3, 379-387.

Meewella C.C., and Mayne D.Q. (1988) An algorithm for global optimization of Lipschitz
continuous functions. J. Optimiz. Theory and Applic., 57, No.2, 307-322.
Metropolis N., Rosenbluth A.W., Rosenbluth M.N. and Teller A.H. (1953) Equations of
state calculations by fast computing machines. J. Chern. Phys., 21, 1087-1091.

Mikhailov G.A. (1966) Calculation of critical systems by the Monte Carlo method. USSR
Comput. Mathern. and Mathern. Phys., 6,No.l, 71-80.

Mikhailov G.A. (1984) Minimax theory of weighted Monte Carlo methods, USSR
Comput. Mathern. and Mathern. Phys., 24, No.9, 1294-1302.

Mikhailov G.A. (1987) Optimizing Weighted Monte Carlo Methods. Nauka, Moscow, 240
p. (in Russian).

Mikhailov G.A., Zhigljavsky A.A. (1988) Uniform optimization of weighted Monte Carlo
estimates. Dokl. Akad. Nauk SSSR, 303, No.2, 290-293.

Mikhalevich V.S., Gupal A.M. and Norkin V.I. (1987) Methods of Nonsmooth
Optimization. Nauka, Moscow (in Russian).

Mitra D., Romeo F. and Sangiovanni-Vincentelli A. (1986) Convergence and finite-time


behavior of simulated annealing. Adv. Appl. Probab. 18, No.3, 747-771.
Mockus J. (1967) Multiextremal Problems in Design. Nauka, Moscow, 214 p. (In
Russian).
References 331

Mockus J. (1989) Bayesian Approach to Global Optimization. Kluwer Academic


Publisher, Dordrecht e.a.
Mockus J., Tiesis V. and Zilinskas A. (1978) The application of Bayesian methods for
seeking the etremum. In: Towards Global Optimization, Vol. 2. North Holland,
Amsterdam e.a., 117-130.
Moiseev E.V and Nekrutkin V.V. (1987) On the rate of convergence of some random
search algorithms. Vestnik of Leningrad University, No.1, 29-34.

Moore R.E. (1966) Interval Analysis. Prentice-Hall, Englewood Cliffs, New Jersey e.a.

Myers R.H. and Khuri A.I. (1979) A new procedure for steepest ascent. Commun.
Statist., A8, No.14, 1359-1376.

Nagaraja B.N. (1982) On the non-Markovian structure of discrete order statistics. J.


Statist. Planning and Inference, 7, No.1, 29-33.

Nefedov V.N. (1987) Searching the global maximum of a function with several arguments
on a set given by inequalities. USSR Comput. Mathern. and Mathern. Physics, 27, Nol,
35-51.
Nelder,J.A. and Mead,R. (1965) A simplex method for functional minimization. Computer
Journal 7,308-313.
Neveu J. (1964) Bases Mathematiques du Calcul des ProbabiIites. Masson et Cie, Paris.
Niederreiter H. (1978) Quasi-Monte-Carlo methods and pseudorandom numbers. Bull.
Amer. Mathern. Soc., 84, N 6, 957-1041.
Niederreiter H. (1986) Low-discrepancy point sets. Monatsh. Mathern. 102, N 2, 155-
167.
Niederreiter H. and McCurley M. (1979) Optimization of functions by quasi-random
search methods. Computing, 22, 119-123.

Niederreiter H. and Peart P. (1982) A comparative study of quasi-Monte Carlo methods for
optimization of functions of several variables, Caribbean J. Math., 1, 27-44.
Pardalos,P.M. and Rosen,J.B. (1987) Constrained global optimizations algorithms and
applications. Lecture Notes in Computer Science 268, Springer-Verlag, Berlin Heidelberg
New York.
Pickands J. III. (1975) Statistical inference using extreme order statistics. Ann. Statist., 3,
No.1, 119-131.

Pinkus M. (1968) A closed form solution of certain programming problems. Oper. Res.
16, 690-694.
332 References

Pinter J. (1986a) Extended univariate algorithms for dimensional global optimization.


Computing, 36, No.1, 91-103.

Pinter J. (1986b) Globally convergent methods for dimensional multiextremal optimization.


Optimization, 17, No.2, 187-202.

Pinter J. (1988) Branch-and-bourd methods for solving global optimization problems with
Lipschitzian structure. Optimization 19, No.1, 101-110.

Pinter,J. (1983) A unified approach to globally convergent one-dimensional optimization


algorithms. Research Report JAMI 83-5. Inst. ApI. Math. Inf. CNR, Milan.

Pinter,J. (1984) Covergence properties of stochastic optimization procedures.


Optimization, 15, No.3, 405-427.

Piyavskii, S.A. (1967) An algorithm for finding the absolute minimum of a function.
Theory of Optimal Solutions, No.2. Kiev, IK AN USSR, pp. 13-24. (In Russian).

Piyavsky, S.A. (1972) An algorithm for finding the absolute extremum of a function.
USSR Comput. Mathem. and Mathem. Phys., 12, No.4, 888-896.

Prakasa Rao B.L.S. (1983) Nonparametric Functional Estimation. Academic Press.


N.Y.e.a.

Price W.L. (1983) Global optimization by controlled random search. J. Optimiz. Theory
and Applic., 40, No.2, 333-348.

Price W.L. (1987) GI,obal optimization algorithms for a CAD workstation. J. Optimiz.
Theory and Applic., 55, No.1, 133-146.

Pronzato L., Walter E., Venot A. and Lebrichec J.P. (1984) A general purpose global
optimizer: Implementation and applications. Mathem. Comput. Simul., 26, 412-422.

Pshenichny B.N. and Marchenko D.L (1967) On an approach to evaluation of the global
minimum. Theory of Optimal Decisions, Vol. 2. Kiev, 3-12 (in Russian).

Pukelsheim F. (1980) On linear regression design which maximizes the information,


J.Statist. Planning and Inference, 4, 339-364.

Pukelsheim P. (1987) Information increasing orderings in experimental design theory.


International statist. Review, 55, No.2, 203-219.

Rastrigin L.A. (1968) Statistical Search Methods. Nauka, Moscow, 376 p. (in Russian).

Ratschek H. (1985) Inclusion functions and global optimization. Mathem. Programming,


33, No.3, 300-317.

Ratschek, H. and Rokra, J. (1984) Computer methods for the range of functions. Ellis
Harwood-Wiley, New York.
References 333

Resnick S.l. (1987) Extreme Values, Regular Variation and Point Processes. Springer,
Berlin e.a.
Revus D. (1975) Markov Chains. North Holland, Amsterdam e.a., 334 p.

Rinnooy Kan A.H.G and Timmer G.T. (1987a) Stochastic global optimization methods. I.
Clustering methods. Mathern. Programming, 39, No.1, 27-56.

Rinnooy Kan A.H.G. (1987) Probabilistic analysis of algorithms. In: Surveys in


Combinatiorial Optimization, North Holland Mathern. Stud., 132. North Holland,
Amsterdam e.a., 365-384.

Rinnooy Kan A.H.G. and Timmer G.T. (1985) A stochastic approach to global
optimization. In:Numerical Optimization, 1984. SIAM, Philadelphia, Pa., 245-262.

Rinnooy Kan A.H.G. and Timmer G.T. (1987b) Stochastic global optimization methods.
II. Multilevel methods. Mathern. Programming, 39, No.1, 5778.

Robson D.S. and Whitlock J.H. (1964) Estimation of a truncation point. Biometrika, 51,
No.1, 33-39.
Rosen J.B. (1983) Global minimization of a linearly constrained concave function by
partition of the feasible domain. Mathern. Res., 8, 215-230.
Sager T. (1983) Estimating modes and isoplets. Commun. Statist.: Theory and Math., 12,
No.5, 529-557.
Saltenis (1989) Analysis of optimization problem structure. Mokslas, Vilnius (in Russian).
Schoen F. (1982) On a sequential search strategy in global optimization problems. Calcolo,
19, No.3, 321-334.
Schwartz L. (1967) Analyse Mathematique, V.1, Hermann, Paris.
Seneta E.(1976) Regularly-varying functions. Lecture Notes in Mathern., 508, Springer,
Berlin e.a.

Shen Z. and Zhu Y. (1987) An interval version of Shubert's iterative method for the
localization of the global maximum. Computing. 38,275-280.

Shubert B.O. (1972) A sequential method seeking the global maximum of a function.
SIAM 1. on Numer. Analysis, 9, No.3, 379-388.

Silverman B.W. (1983) Some properties of a test for multimodality based on kernel density
estimates. London Mathern. Soc. Lecture Note Ser., No.79, 248-259.
Smith R.L. (1982) Uniform rates of convergence in extreme value theory. Adv.Appl.
Probab., 14, 600-622.
Smith R.L. (1987) Estimating tails of probability distributions. Ann. Statist., 15, No.3,
1174-1207.
334 References

Snyman I.A. and Fatti L.P. (1987) A multistart global minimization algorithm with
dynamic search trajectories. I. Optimiz. Theory and Applic., 54, No.1, 121-141.

Sobol I.M. (1969) Multivariate Quadrature Formulas and Haar Functions. Nauka,
Moscow, 288 p. (in Russian).
Sobol I.M. (1982) On an estimate of the accuracy of a simple multidimensional search.
Dokl. Akad. Nauk SSSR, 266, 569-572 (in Russian).

Sobol I.M. (1987) On functions satisfying a Lipschitz condition in multidimensional


problems of numerical mathematics. Dok. Akad. Nauk SSSR, 293, N 6, 1314-1319 (in
Russian).

Sobol I.M. and Statnikov R.B. (1981) Optimal Choice of Parameters in Multicriteria
Problems. Nauka, Moscow, 110 p. (in Russian).

Solis F and Wets R. (1981) Minimization by random search techniques. Mathern. Oper.
Res., 6, No.1, 19-30.
Spircu L. (1978) Cluster analyisis in global optimization. Economic Computation and
Economic Cybernetic Studies and Research, 13,43-50.
Strongin R.G. (1978) Numerical Methods in Multiextremal Optimization. Nauka,
Moscow, 240 p. (in Russian).
Sukharev A.G. (1971) On optimal strategies for an extremum search. USSR
Comput.Mathem. and Mathern. Phys., 11, N. 4, 910-924.
Sukharev A.G. (1975) Optimal Search of the Extremum. Moscow University Press. 100
p. (in Russian).

Sukharev A.G. (1981) Construction of a one-step-optimal stochastic algorithm seeking the


maximum. USSR Comput. Mathern. and Mathern. Phys., 21, No.6, 1385-1401.
Suri R. (1987) Infinitesimal perturbation analysis for general discrete event systems.
Ioum. of ACM, 34, N 3, 686-717.
Tom A. (1978) A search-clustering approach to global optimization. In: Towards Global
Optimization, Vol. 2 North Holland, Amsterdam e.a., 49-62.

Tom, A.A. and Zilinskas, A. (1989) Global Optimization. Lecture Notes in Computer
Science 350 Springer-Verlag, Berlin Heidelberg New York.

Trecanni G. (1978) A global descent optimization strategy. In: Towards Global


Optimization, Vol. 2. North Holland, Amsterdam e.a., 151-163.

Turchin V.F. (1971) On the calculation of multivariate integrals by the Monte Carlo
method, Theory of Probab. and Applic., 16, No.4, 738-741.
References 335

Ustyuzhaninov V.G. (1983) Possibilities of random search in solution of discrete extremal


problems, Kibernetika, No.2, 64-71, 77.

van Laarhoven P.J.M. and Aarts E.H.L. (1987) Simulated Annealing: Theory and
Applications, Kluwer Academic Publishers, Dordrecht e.a., 198 p.

Vasiliev F.P. (1981) Methods for Solving Extremal Problems. Nauka, Moscow, 400 p. (in
Russian).

Vaysbord E.M and Yudin D.B. (1968) Multiextremal stochastic approximation.


Engineering Cybernetics, 5, No.1, 1-10.

Vilkov A.V., Zhidkov N.P and Shchedrin B.M. (1975) A method for finding the global
minimum of a function of one variable. USSSR Comput. Mathern. and Mathern Phys., 15,
No.4, 1040-1042.

Walster G.W., Hansen E.R. and Sengupta S (1985) Test results for a global optimization
algorithm. In: Numerical Optimization (eds. Boggs P.T. and Byrd R.H.), SIAM, New
York 272-287.
Watt V.P. (1980) A note on estimation of bounds of random variables. Biometrika, 67,
No.3, 712-714.
Weisman I. (1981) Confidence intervals for the threshold parameter. Commun. Statist.,
AlO, No.6, 549-557.
Weisman I. (1982) Confidence intervals for the threshold parameter II: unknown shape
parameter. Commun. Statist.: Theory and Meth., 21,2451-2474.
Weiss L. (1971) Asymptotic inference about a density function at the end of its range.
Naval Res. Logistic Quaterly, 18, No.1, 111-115.
Wood G.R. (1985) Multidimensional bisection and global minimization. Presented at the
IIASA Workshop on Global Optimization. (Held in Sopron, Hungary).
Yakowitz S.J. and Fisher L. (1973) On sequential search for the maximum of an unknown
function. J. Math. Anal. and Appl., 41, No.1, 234-259.
Yamashita H. (1979) A continuous path method of optimization and its application to
global optimization. In: Survey of Mathematical Programming, VoLl/A, Budapest, 539-
546.
Zaliznyak N.F. and Ligun A.A. (1978) On optimum strategy in search of global maximum
of function. USSR Comput Mathern. and Mathern. Phys., 18, No.2, 314-321.
Zanakis S.H and Evans J.R. (1981) Heuristic optimization: why, when and how to use it.
Interfaces, 11, No.5, 84-91.
Zaitg I. and Avriel M. (1975) On functions whose local minima are global. J. Optimiz.
Theory and Applic., 16, No.3/4, 183-190.
336 References

Zhidkov N.P. and Schedrin B.M. (1968) A certain method of search for the minimum of a
function of several variables. Computing Methods and Programming, Moscow University
Press, V.lO., 203-210 (in Russian).
Zhigljavsky A.A. (1985) Mathematical Theory of Global Random Search. Leningrad
University Press, 296 p. (in Russian).
Zhigljavsky A.A. (1987) Monte Carlo methods for estimating functionals of the supremum
type. Doctoral's thesis. Leningrad University. (In Russian.)
Zhigljavsky A.A. and Terentyeva M.V. (1985) Statistical methods in global random
search. Testing of statistical hypotheses. Vestnik of Leningrad University, No.15, 89-91.
Zhigljavsky a.A. (1981) Investigation of probabilistic global optimization methods.
Candidate's Thesis. Leningrad University (in Russian).
Zhigljavsky A.A.(1988) Optimal designs for estimating several integrals. Optimal Design
and Analysis of Experiments (eds Y.Dodge, V.V. Fedorov and H.P. Wynn). North
Holland, Amsterdam e.a., 81-95.
Zielinski R. (1980) Global stochastic approximation: a review of results and some open
problems. In :Numerical Techniques for Stochastic Systems (eds Archetti F. and Cugiani
M.). North Holland, Amsterdam e.a., 379-386.
Zielinski R. (1981) A statistical estimate of the structure of multiextremal problems.
Mathern. Programming, 21, 348-356.
Zielinski R., and Neumann P. (1983) Stochastische Verfahren zur Suche nach dem
Minimum einer Funktion. Akademie-Verlag, Berlin, 136 p. .
Zilinskas A. (1978) Optimization of one-dimensional multimodal functions. Algorithm AS
133. Applied Statistics, 23, 367-375.
Zilinskas A. (1981) Two algorithms for one-dimensional multi modal minimization.
Optimization, 12, No.1, 53-63.
Zilinskas A. (1982) Axiomatic approach to statistical models and their use in multimodal
optimization theory. Mathern. Programming, 22, No.1, 104-116.
Zilinskas A. (1984) On justification of use of stochastic functions for multimodal
optimization models. Ann. Oper. Res. 1, 129-134.
Zilinskas A.G. (1986) Global Oprimization: Axiomatics of Statistical Models, Algorithms,
Application. Mokslas, Vilnius, 166 p. (in Russian).
SUBJECT INDEX

accuracy with respect to function argument values 2


algorithm
descent 21
deterministic 8
direct search 22
Evtushenko 46
generalized gradient 22
grid 35
Markovian 93
one-step optimal 72
optimal grid 49
optimal randomized 51
probabilistic 8
randomized 51
sequentially best 50
Strongin's 57
annealing process 95
asymptotically optimal 49
linear estimator 127
Baba's algorithm 101
bias inaccuracy 310
bisection method 22
branch and bound method 148
branch and probability bound method 148
branching of optimization region 64
Branin's method 32
candidate points method 24
class of continuous functions 3
class of measurable functions 3
class of uniextremal functions 4
cluster analysis 24
composite 4 0
computational effort 1
conjugate direction 22
Consistency condition 247
con trained optimization 219
convergence
domain 16
speed 16
convex programming 21
coordinate-wise search 22
cover 35
covering
nonuniform random 84
sequential 36
covering method 35

337
338 Subject Index

criterion
prospectiveness 147
cubic grid 40
cumulative distribution function neutral to the right 179
cyclic coordinate-wise descent 22
decreasing randomness 172
dependent sampling procedures 119
descent algorithm 21
deterministic algorithm 8
direct search algorithms
first order 22
second order 22
discrepancy 39
discrete optimization 233
discrete problems 3
dispersion 38
distribution sampling 219
distribution
extreme value 126
Gibbs 96
quasi-uniform 230
domain convergence 16
domination 168
E-optimal 72
estimate
optimal consistent 247
optimal consistent unbiased 247
estimator
asymptotically optimal linear 127
Evtushenko algorithm 46
experimental design 18
extremal experiment algorithm 285
extreme order statistics 240
extreme value distribution 126
feasible region 1
filled function 27
function
cumulative distribution 179
filled 27
homogeneous 124
tunneling 27
functional class 3
general statement of the optimization problem 4
generalized gradient algorithm 22
Gibbs distribution 96
global minimization method 1
global minimization problem 1
global minimizer 2
global stochastic approximation 99
gradient 4
gradient method 22
Subject Index 339

grid 35
algorithm 35
compisite 40
cubic 40
Hammersley-Holton 41
Lattice 41
II't 41
quasirandom 41
random 40
rectangular 40
stratified sample 42
guaranteed accuracy 49
Hammersley-Holton grid 41
heavy ball 31
Hessian 4
Holton sequence 41
homogeneous function 124
inaccuracy 16
index
tail 126
infinite dimensional 3
initial point 284
interval method 66
interval variables 67
Lattice grid 41
level surfaces 118
linear estimator 127
local minimizer 21
local optimization problem 21
local optimization theory 4
Markovian algorithm 93
Markovian property 93
mathematical programming 21
maximization problem 1
method of generations 190
with constant number of particles 204
method
random multi start 174
branch and bound 148
branch and probability bound 148
Branin's 32
candidate points 24
covering 35
interval 66
Metropolis 94
multi-level single linkage 26
nearest neighbour 25
polygonal line 45
Metropolis method 94
mode 118
340 Subject Index

mode domain 118


multi-level single linkage method 26
multidimensional problem 3
multiextremal stochastic approximation 99
multiple criteria 231
multistep dimension reduction 59
multivariate maximal spacing 79
nearest neighbour method 25
Newton method 22
nonparametric regression function estimation 308
nonuniform random covering 84
numerical comparison 15
objective function 1
one dimensional problem 3
one-step optimal algorithm 72
optimal 48
optimal consistent estimate 247
optimal consistent unbiased estimate 247
optimal grid algorithm 49
optimal randomized algorithm 51
optimal
asymptotically 49
in order 49
one-step 72
optimality 17
optimization
constrained 219
discrete 233
P-optimal 72
Pareto set 231
partition of optimization region 64
Peano curve type mapping 60
polygonal line method 45
presence of random noise 4
prior information 1
probabilistic algorithm 8
projection estimate 309
property
Markovian 93
prospectiveness criterion 147
H.grids 41
quasi-uniform distribution 230
quasirandom 41
quasirandom points 41
p-dispersion 38
random coordinate-wise search 22
random grid 40
random inaccuracy 310
random m-gradient method 22
random multistart method 174
Subject Index 341

random search
uniform 77
with learning 22
with uniform trial 22
randomized algorithm 51
rectangular grid 40
regression experiment 308
regression function 284
regression
nonparametric function estimation 308
Renyi representation 241
sample
stratified 156
screening experiment 61
search direction 21,284
search
coordinate-wise 22
random coordinate-wise 22
separable function 6
sequential covering 36
sequentially best algorithm 5 0
simulated annealing 95
simulation models 18
speed of convergence 16
steepest descent method 22
step length 21,284
stochastic
differential equation 98
global approximation 99
multiextremal approximation 99
stochastic approximation 284
stochastic programming 88
stratified sample 156
stratified sample grid 42
Strongin's algorithm 57
tail index 126,241
tunneling function 27
unbiasedness requirement 247
uniform random covering 83
uniform random search 77
upper bound 239
upper bound random variable 123
variable
upper bound random 123
variable-metric method 22
yearly maximum approach 239

You might also like