1061 Comparison of Genetic Algorithms With Conjugate Gradiente Methods Jack Bosworth 1972 19720022876
1061 Comparison of Genetic Algorithms With Conjugate Gradiente Methods Jack Bosworth 1972 19720022876
REPORT <I,/ -
m
gc
0
N LOAN COPY: RETURN TO
M AFWL (DOUL)
U KIRTLAND AFB, N. MI
1.Report
No. 2. Government Accession No. 3. Recipient’s Catalog No.
NASA CR-2093 -I
4. Title and Subtitle 5. Report Date
-. I
7. Author(s) 6. Performing Organization Report No.
16. Abstract
Function optimization
Mathematical Programing Unclassified - Unlimited
19. Security aassif. (of this report) 20. Security Classif. (of this page) lZl.NoGPap.
Unclassified I Unclassified
*For sale by the National Technical Information Service, Springfield, Virginia 22151
I. Introduction
points of the space at which the function attains its optimum (minimum
or maximum) values. A direct sear& aZgoritkm for solving such an
optimization problem is an iterative step-by-step procedure which samples
a number of points in the space until a point is found which is apparently
optimum.
Function optimization problems requiring direct search algorithms
arise from the general area of the design of optimal control systems
(Athans and Falb (1966)). The optimal point of view, when applied to
the control of aerospace vehicles or chemical processing plants, for
example, involves control systems which perform optimally according to
some pre-determined criteria of performance. Often the design of such
systems leads to function optimization problems which cannot be solved
analytically and therefore necessitate direct search algorithms for their
solution (Kalman, Falb, Arbib (1969), Lavi and Vogl (1965)).
In many control applications, however, not enough is known about the
plant (controlled system) behavior to formulate beforehand a realistic
optimal control problem. In this case, one may design a control system
from the adaptive control point of view (Bellman (1959), Mishkin and Braun
(1961), Feld'baum (1966), Sworder (1966)). An adaptive control system
attempts to optimize the performance of the plant "on line", i.e., the
controller attempts continually to improve the plant's performance, its
actions being based upon its record of past plant responses to control
inputs and environmental disturbances. An adaptive controller must possess
as essential subcomponents, direct search algorithms which can direct the
search toward optimum points of the criterion function (Wilde (1964),
Hall and Ratz (1967)).
Thus the successful design of optimal and adaptive control systems
to any other plan, i.e., it sustains only a finite loss over infinite
is a severe test for the genetic methods since on the one hand they do not
extreme efficiency for these functions. Thus from this point of view
one may expect relatively inferior performance from the genetic methods.
methods can be more efficient than fixed step size gradient methods. Since
Hollstien claims superior performance for his methods over those of
3
Rastrigin this opens the possibility that genetic methods can compete favor-
ably with the conjugate gradient methods (which are themselves more
powerful than the fixed step size gradient methods).
--- ~-- -.
II. Description of Program
f:Rn + R.
The function va2v.e associated with a string is just the value of the
the value associated with the above string is f(.l, 1.3, -.4, .02) (not
Version I
no
each. Only one inversion pattern was associated with each subpopulation.
I.e., any two strings in the same subpopulation had the same associated
inversion pattern. A vector called the utility vector was maintained
giving the function value of each string.
with pivot points 2 and 4 say. The resulting strings are alb2b3b4a5 and
bla2a3a4b5'
Inversion consisted of ordering the four subpopulations by their
best strings, 1 copying the best two subpopulations into the worst two
subpopulations, and changing the inversion patterns of the copies as
for each copy and all strings were inverted about these pivot points,
jOur Fletcher-Reeves method uses 2n samples for its gradient estimation and
30 samples for its one dimensional search per iteration (n is the dimension
of the space). 7
4) zero mutation. The string is left unaltered.
For each of the forty strings one of these four methods of mutation
was chosen according to the probability vector and applied to the point
value vector) was initialized with the associated function values. All
other parameters were considered to be subject to experimental manipulation
and initialized accordingly.
Version II
Version I was modified to create Version II in the following ways:
Selection replaced the worst four strings in each subpopulation with
four strings from the same subpopulation as follows. The strings are
Cross-over was done in the same way. Note that the selection now
caused cross-over to occur between the best strings and randomly chosen
strings, rather than among the best strings themselves.
The four mutation methods of I were used except that 2) was altered
8
as follows:
5) Uniform Raxdom with VariabZe Limits. This method was like the
old 2) but the limits between which rl,..., rm were chosen were different
for different coordinates of the point. Let these limits be
Each string was mutated as before, but when the best string in
each subpopulation was mutated (according to the probability vector),
the mutant replaced the worst member in the subpopulation (the best string
was also saved unmutated). _-
The major addition to the program structure was a second level ,
"adaptation" routine which controlled some of the parameters previously
9
Version III
I-initialization I
The major change introduced was that there was no partitioning of the popula-
tion into four distinct subpopulations. The population size was determined dyn-
10
was defined as the number of strings to be mutated. Suppose the program
began with m strings (m assumed notless than ml), then the ml strings
which had the highest associated function values among the initial m strings
were chosen. These ml strings were copied. Each of the ml copies was
mutated using a method chosen randomly with the probability vector deter-
mining the frequency of selection of any given mutation method. The mutation
methods were the same as l), Z), 3) and 2') of Version II. Method 5)
was not implemented in the Version III mutation routine. (As before,
the utility vector was updated and the history vector was maintained.)
The adaptation routine was essentially the same as the adaptation
routine of Version II (allowing for the differences in the structure
of the history vector). The major difference was that a weighting scheme
was introduced to evaluate method effectiveness so that a heavily weighted
method had to produce a higher percentage difference in the best function
value than a method not weighted so heavily in order to have the ratio
of the probabilities of these two methods remain the same. These weights
were initialized.
The cross-over routine was altered as follows. Let m2 be the
initialized parameter indicating the number of strings which the routine
would operate on. Z-ml (the number of strings leaving the mutation routine)
was assumed greater than or equal to m2’ The best m2 strings among the
Z-ml strings were chosen. Cross-over initiated by copying the strings
present and pairing the copies randomly. Then the alleles (coordinate
values) of the string with the higher function value between and including
the pivot points were exchanged with the corresponding alleles of the
other string. Equivalently the normal cross-over operation is performed
11
except that the inversion pattern of the worse string is replaced by that
of the better string before the exchange is begun. After the exchange
one of the daughters receives.the worse string's inversion pattern (the
other daughter inheriting the better string's pattern). For example, if
ala2a3a4a5 with pattern 12345 and blb2b3b4b5 with pattern 54321 are to
be crossed over, first create b5b4b3b2bl with pattern 12345 and do the
cross-over as usual. With pivot points 2 and 4 for example, we obtain
alb4b3b2a5 and b a a a b . One of these is given pattern 12345 while
52341
the other gets 54321.
The number of successive cross-overs was not held at one (as before),
but was determined by an initialized maximum bound i subject to the
constraint that the process was to be stopped if the population size reached
40. (Note that the population doubles at each successive cross-over
and 2' = 32 so i < 5.)
The inversion routine always produced ml strings. Assuming the
entering population size exceeded ml the best 'ml/Z' (the least integer
greater than ml/Z) strings were chosen. Each such string was copied and
the inversion pattern of the copy was determined by randomly chosen pivot
points as before. (Production was halted when ml strings were produced.)
Version IV
Version IV was exactly the same as Version III except that in the
mutation routine some of the original ml strings were mutated as well.
Thus an initialized parameter mi < ml determined that mi randomly chosen
strings from the original ml strings not including the best were to be
mutated in the same manner as the ml copies already produced.
12
III. Test Functions
2. Index
40
f2(x) = C ix:
i=l
3. Index squared
f3(x) = %Oi'xf
i=l
4. wood
22 2
f4(x) = 100(x2-x1)+(1-x1)
22 2
+ 90(x4-x3) +(1-x3)
+ lO.l((x,-1)2+(x4-1)2)
+19.8(x2-1)(x4-1)
5. VaZleys
fg(x) = Z i2(x5+i-xi)2+ixij
i=l
6. Repeated Peaks
13
NOTE: Functions l-5 are to be minimized so that in the program f(x)
is replaced by -f(x) and the standard maximization formal is satisfied.
14
IV. Comparison of Genetic and Classical Methods
Version IV and FRl were applied to each of the test functions 2 through
5 with the same initial set of points.
The resuZts are shown in Tables 1 and 2. In Table 1, we record for
each test function the number of function evaluations taken by FRl each
time the mutation routine is executed. In comparison the number of
15
TABLE 1
-
TABLE 2a
17
TABLE 2b
Number of Function
Function evaluations required by
Test Function value attained Version IV after FRl hung up
18
-..- _._,.-_.----.,_-_-_--,_,.-- --- --,m. ,-..,.,.-1.1 l,~..~..~--,,,l..,.-,~,-, I._.. ,,-~-1,.,,.,, ,.,11.--1-m..111mm mm...,,., 1.1..-1-11m1 1.1mm.m1.1.11’
function evaluations taken by Version IV to achieve the same change in
function value is indicated (along with the change in value achieved).
The function value attributed to a population is that of its best string.
In Table 2awe record the total number of function evaluations taken
by the methods to reach the indicated level.
In these tables we have given both the actual number of FRl function
evaluations and this number divided by 4. The latter is a lower bound
on the number of function evaluations were the classical Fletcher-Reeves
method (i.e., our method in its nonHUform) to be applied to the best point
in the initial population.
It may have become apparent to the reader that we face the difficulty
here of comparing the parallel operating genetic methods with the sequential
conjugate gradient methods. Our genetic algorithms must start with a number
of initial points. The Fletcher-Reeves method begins at one point. We
have observed that the rate of convergence of Fletcher-Reeves may be quite
variable depending on the nature of the current search region (for example
whether it is locally quadratic or near a sharp ridge) and the number of
iterations taken since the last re-initialization.
Clearly some kind of aggregate behavior of a method over the search
space is required for meaningful comparison. While parallel methods lend
themselves more to this form of analysis little is known analytically for
either type of method in the present context. Then too which aggregate is
to be used: the maximum rate of convergence? the average? the minimum?
What if a method fails to converge from some starting points but converges
rapidly from others?
19
As already indicated, our decision was to embed the Fletcher-Reeves
method in a Version II genetic program. If we ignore the effects of cross-
over: this is equivalent to applying Fletcher-Reeves to the best point in each
of the 4 subpopulations, the number of function evaluations required to
reach a given function value level being then four times the number required
by Fletcher-Reeves applied to the point which reaches this level first.
knowing before hand which of the four initial points would actually reach
this level first we would need only l/4 of the total. Thus the "divided
by four" columns of Tables 1 and 2a represent an "optimistic" estimate of
Fletcher-Reeves efficiency. This optimism will be well founded if the var-
iability of convergence is low (so that knowing which starting point is
ultimately best is unimportant) and inappropriate if the variability is in
fact high in which case the "pessimistic" upper bound is justified.
Our results indicate that except for the behavior on the spherical
contours, Wood and Repeated Peaks function there is not a vast difference
in convergence rates.
seems to get hung up in mid course though its mutation facilitates enable
it to make a recovery).
Repeated Peaks is a multiple peak function and thus should be beyond
in the fact that FRl hangs up on the local peak on which it is initiated.
n < 78 and superior on the latter for n > 2. The comparison is in terms
value levels than was FRl. This is shown in Table 2b which gives Version IV's
behavior starting from the levels indicated in Table 2a. The latter levels
are those for which FRl's progress terminated. (This may be an artifact of
our Fletcher-Reeves realization.)
22
V. Evolution from Version I to Version IV
3), 2') and 5) were introduced in order to bias the distribution toward
small changes. This improved convergence by helping the system move
off false resolution ridges (Wilde, 1965). Later we discovered that
adding a second level routine (Version II) to modify the biasing on the
basis of past experience considerably improved performance. Table 3a
indicates the effectiveness of the adaptation routine. Our analysis of the
reasons for the improvement obtained is as follows.
As a run progresses the best alleles must be changed less in order
to improve. For this reason, the standard deviation of a random mutation
must decrease in order to improve the probability of a better mutation.
To this end we implemented some history vectors and added a program to
adapt the mutation parameters in a Bayesian approximation. Thus, if a
smaller mutation had worked best, the standard deviation was decreased
and like-wise for a larger mutation. If there has been no improvement
in function value over the period of history the standard deviation was
halved assuming that it had been too large. When the parameters became
too small for the accuracy of the machine, they were reset to maximal
values.
It was apparent that the kind of mutation which worked best at
one point in a run was sometimes different for a different part of the run.
For this reason more history was kept and the probabilities of the differ-
ent mutation methods were changed. This seems to work but does not usually
give marked improvement in the performance of the system.
23
Non-uniform distributions worked better than uniform ones when no
adaptation was applied since there was a higher probability of small
change.
With the adaptation, uniform works best under certain conditions
since the probability of making the right size change is higher and adap-
tation can progress faster. Under different conditions the uniform is
more likely to put the adaptation parameter in a "quasi-stable" state
where change is quite smooth but too slow to be useful.
By a "quasi-stable" state we mean a situation in which the adjusted
parameter is maintained for a long period of time at a suboptimal value.
This can happen in our present system since we include no random or
regular reset. Resetting of the parameter occurs only when it has passed
below a preset limit. Thus a situation in which the parameter is not
below the preset level but is still too small to cause significant changes
in the function values of mutated points will result in "quasi-stable"
state since the information fed back to the adaptive routine is insufficient
to cause a directed change in the parameter setting.
We tested the more complicated variable limit mutation (5) against
the simpler quadratic mutation (2). Table 3b indicates that 5) was indeed
better than 2) on the two functions shown. However, used with the additional
24
TABLE3a
Value
Function Attained Version I Version II
1. Spherical -2.045E+l* 90 90
contours -5.28 900 1050
-3.78 1350 1175
-2.97 2250 1350
-2.48 2700 i440
-1.80 4400 1440
-1.36 5500 1525
-1.26 5840 1620
-1.00 10,750 1700
- .753 13,400 1890
- .472 14,400 2150
- .268 17,820 2700
- .218 28,600 2790
- .217 >38,300 2790
Value
Function Attained (21 (5)
2. Index -700 10 20
-400 15 50
-300 36 70
-200 75 110
-100 190 170
- 80 260 190
- 60 310 220
- 40 550 260
- 20 >4200 470
4. Wood - 15.0 7 10
- 10.0 9
- 9.0 10 20
- 4.0 12 40
- 2.0 15 60
- 1.0 46 70
.5 60 80
.l >700 170
.Ol >700 370
26
TABLE 3c
Value
Function Attained 1 3/4(1)+1/4(3)
3. Index -100 9 10
squared - 10 34 34
43 35
l----
: 6" 55 41
- 5 64 45
27
TABLE 3d
28
and inversion) do not play much of a role. However when we used gradient
mutation (i.e., Fletcher-Reeves with q = 1) we found that it worked
better in conjunction with random mutation than alone (see Table 3~).
It was apparent that the mutation often changed the best string in
a given population for the worse. When we saved the best string in each
subpopulation by replacing the worst string with the mutation results
from the best string, the performance was increased several fold. (Table 3d).
29
TABLE 4
Function
Function value attained BB BR
1. Spherical -500 10 6
contours -400 19 15
-300 48 36
-200 190 75
-100 >400 190
3. . Index -10,000 16 16
squared - 8,000 30 22
- 6,000 50 34
- 4,000 108 75
- 2,000 >2700 200
4. Wood -15 2 7
function -10 9 9
- 9 21 10
-4 32 12
-2 >600 15
30
TABLE 5
Function
I alleles in 1st four co-or-
dinates available in initial
population
After generation 12 only
one string remained:
2. Index 1 2 1 2 3 4
-.6048 .0976
*Clearly, for the index function, the smallest are the best.
31
"good" alleles in another way. Table 6 shows that Version IV without a
crossover routine was unable to achieve the ultimate performance of
Version IV using 2 crossovers per generation.
The effectiveness of inversion was similarly tested (Table 7).
subpopulations?
Doing away with subpopulations also forces the question: What
strings will be crossed over and how? Suppose that only strings having
the same inversion pattern may be crossed over.Suppose the function has
n variables. Then there are n!/2 essentially different inversion pat-
terns since any permutation of the variables is an inversion pattern,
but turning any inversion pattern end for end preserves clumpings. This
means that for functions with more than three or four variables cross-over
32
TABLE 6
34
would take place very seldom in a set of strings which have individual
inversion patterns. Therefore a more general kind of cross-over must be
employed. As already indicated, we tried crossing over two strings with
different inversion patterns by picking two pivot points as before but
applying the pivot points to the string with the better function value
and simply exchanging the alleles involved with the corresponding alleles
of the worse string no matter where those alleles are in the string.
This kind of cross-over allows unrestricted cross-over with only a
slight computing cost to find the "corresponding alleles". It asswnes
that the string with the better function vaZue usuaZZy has the better
inversion pattern. That is, that the inversion pattern clumps the right
variables. Although this type of cross-over is only slightly different
from the first, its consequences are more difficult to predict. It seems
to be about half as effective at finding the best inversion pattern.
The results obtained from the version III and IV systems are often
uncertain because they have more than a dozen parameters. The purpose
of having open so many parameters was that we wished to be able to test
hypotheses which we had formulated as a result of our experience with
the version II system. These parameters have proved to be quite inter-
dependent. That is, we find that for any reasonable setting of all but
one parameters, (that one being arbitrary), varying the single parameter
has a strong effect on the efficiency of the system. However, having
found the optimal value for that parameter with the others fixed,
changing some of the other parameters frequently changes the optimal
value for the.one parameter by a.large amount. An important project
for the future will be to chart the interrelations involved.
However we were able to show that there were settings of Version III
parameters which yielded performance much superior to Version II (Table 8)
35
TABLE 8
Value
Function Attained Version II Version III
36
thus justifying the change in system structure.
VI. Conclusions
If the reader finds himself unable to formulate a clear statement of
the results of our work to date, let us assure you that we feel ourselves to be
in the same position. We have constructed a class of algorithms which
are sufficiently complex to be highly interesting, but which at the same
time are not readily amenable to analytical study and classification?
Thus, without the benefit of theoretical guidance we are reduced to stab in
the dark experimentation. Moreover, since a single optimization run
takes hours to complete our rate of progress in obtaining data is
not as fast as our enthusiasm demands.
Within these cmstraints however, we have obtained suggestive results
concerning the comparative behavior of genetic and conjugate gradient
algorithms, and we have also come to some conclusions concerning the
effectiveness of various subcomponents of the genetic algorithms.
We have for example demonstrated optimization problems where encorporating
individually the crossover and inversion operators actually does achieve
a significant improvement in the rate of convergence (Tables 5,6,7).
Clearly much remains to be done in confirming or disconfirming these con-
clusions.
37
APPENDIX
used.
Let fi denote the function value of the best string (the one with the
highest function value) just before the ith mutation following the last
adaptation. Let f' denote the function value of the best string just before
the mutation which occurred just before the last adaptation. In the case
of the version I system i has values between 1 and ten. i has values between
In the case of subpopulations, each has its own best string and its
own average mutation for each generation. This requires double subscripting.
To avoid this we will only consider the version II system. The same methods
are used and an exact understanding of the version I system may be gained
from the program listings. Let ai denote the average of all mutations
used in the ith generation following the last adaptation. Let a' ccrres>ond
loWhen a string is mutated using one of the methods which use the adaptation
parameter k?, a random number of coordinates are chosen to be mutated, sayth
i (1 2 i < n). If the "cubic" mutation is used an r. is chosen for the j
3 The absolute values,
coordinate if it is to be mutated where r. E [-l,l].
Ir.1, of these i numbers are averaged, thJ average being say bk where
thJ string mutated was the kth string to be mutated that generation. If
the "quadratic" mutation is used an r . and an r are chosen for the jth
coordinate if it is to be mutated whe% r E 2*i numbers
the average be& b string mutated
Let ai denote the average o!2 all the bk in the ith
generation.
38
is .5 since the ri and r.. have absolute values uniformly distributed
Jl
between 0 and 1. Let a* and a* denote respectively the maximum and minimum
new t was less than an initialized constant, .4?was reset to the value & was
the next method. Let ki be the number of strings to which the method was
applied correspond to the ai (from adapting e). Let k' be the number just
before the last adaptation (like a' and f'). If m strings were mutated
39
p could be changed by no more than one tenth in the same manner as .& could
be changed by no more than one half. Therefore p was normally set to p'.
The probabilities in the vector had upper and lower limits as to numerical
40
VII. References
41
Hall, C.D. and Ratz, H.C. (1967) "The Automatic Design of Fractional
Factorial Experiments for Adaptive Process Optimization" Information
and ControZ, 11, pp. 505-527.
Hartmanis, J. and Stearns, R.E. (1969) "Computational Complexity"
Information Sciences.
Hill, J.D. (1969) "A Search Technique for Multimodal Surfaces"
IEEE Trans. on Sys. 5%. and Cy., SSC-3, January.
Holland, J.H. (1969) "A New Kind of Turnpike Theorem" BuZZetin of
the American Math. Sot., 75, 6.
Hollstien, R.B. (1971) "Artificial Genetic Adaptation in Computer Control
Systems" Doctoral Thesis. Department of Computer Information and
Control Engineering, The University of Michigan.
Kalman, R.E.; Falb, P.L.; Arbib, M.A. (1969) Topics in Mathematical System
Theory. McGraw-Hill Book Co.
Lasdon, L.S. (1971) Optimization Theory for Large Systems. MacMillan.
42
Rastrigin, L.A. (1963) "The Convergence of the Random Search Method in
the Extremal Control of a Many Parameter System" Atiomtion and Remote
Co&&., 24, pp. 1337-1342.
Rosenberg, R. (1967) tlSimulation of Genetic Populations with Biochemical
Properties" Doctoral Thesis, Department of Computer and Communication
Sciences, The University of Michigan.
Rosenbrock, H.H. (1960) "Automatic Method for Finding the Greatest or
Least Value of a Function" The Computer Journal, 3, pp. 175-184.
Schumer, M.A. and Steiglitz, K. (1968) "Adaptive Step Size Random Search"
IEEE Trans. on Aut. Control, AC-13, 3.
Shekel, J. (1971) "Test Functions for Multimodal Search Techniques"
Fifth Annual Princeton Conference on Information Science and Systems.
Spang, H.A. (1962) "A Review of Minimization Techniques for Non-Linear
Functions" SIAM Review, 4, pp. 343-365.
Sworder, D. (1966) Optimal Adaptive Control Systems. Academic Press.
Wilde, D.J. (1964) Optimum Seeking Methods. Prentice-Hall.
Wood, C.F. (1964) "Review of Design Optimization Techniques"
Westinghouse Research Laboratories, Science Paper 64-SC4-361-Pl.
Zeigler, B.P. (1969a) "On the Feedback Complexity of Automata" Doctoral
Thesis, Department of Computer and Communication Sciences, The
University of Michigan.