The best theory is inspired by practice,
The best practice is guided by theory.- Donald Knuth, IFIP 1992 address.
It is difficult to predict a genetic algorithm's behavior on an arbitrary problem. Combining genetic algorithm theory with practice, this chapter uses the average hamming distance as a syntactic metric to derive bounds on the time convergence of genetic algorithms. Analysis of a flat function provides worst case time complexity for static functions. Further, employing linearly computable runtime information, this thesis provides bounds on the time beyond which progress is unlikely on arbitrary static functions. As a byproduct, the analysis also provides qualitative bounds by predicting average fitness.
Natural selection uses diversity in a population to produce adaptation. Ignoring the effects of mutation for the present, if there is no diversity there is nothing for natural selection to work on. Since GAs mirror natural selection, this chapter applies the same principles and uses a measure of diversity for estimating time to stagnation or convergence. A GA converges when most of the population is identical, or in other words, the diversity is minimal. Using average hamming distance (hamming average) as a measure of population diversity this chapter derives bounds for the time to minimal diversity, when the genetic algorithm may be expected to make no further progress (hamming convergence).
Previous work in GA convergence by Ankenbrandt, and Goldberg and Deb focuses on a lower limit on the time to convergence to a particular allele (Ankenbrandt, 1991; Goldberg and Deb, 1991) using fitness ratios to obtain bounds on time complexity. Since GAs use syntactic information to guide their search, it seems natural to use syntactic metrics in their analysis. The analysis here, using hamming averages, can predict the time beyond which qualitative improvements in the solution are unlikely. Since the schema theorem intimately links schema proportions with fitness, bounds on average fitness follow.
The next section identifies the effect of genetic operators on the diversity metric, which is the hamming average. Deriving an upper bound on the expected time to convergence suggests syntactic remedies to the problem of premature convergence and, as a side effect, how to mitigate deception in GAs. Since crossover does not affect the hamming average, extrapolating the change in the hamming average sampled during the first few generations is used to predict the hamming average in later generations. Results on a test suite of functions indicate that accurate predictions are possible.
The average hamming distance of a population is the average distance between all pairs of strings in a population of size N. As each member of the population is involved in N-1 pairs, the sample size from which to calculate the average is:
Let the length of the member strings in the population be l. The hamming
average of the initial population of binary strings is well approximated by the
normal distribution with mean where
and standard deviation given by
Ignoring the effect of mutation, the hamming average of a converged population
is zero (mutation increases the hamming average of a converged population by an
amount depending on the probability of mutation). Given that
the initial hamming average is l/2 and the final hamming average is zero, the
effects of selection and crossover determine the behavior of a genetic
algorithm in the intervening time.
Assuming that offspring replace their parents during a crossover, all crossover
operators can be partitioned into two groups based on whether or not they
change the hamming average. If one parent contributes the same alleles to
both offspring (as in masked crossover (chapter )) the
hamming distance between the children is less than the hamming distance between
their parents. This leads to a loss of genetic material, reducing population
hamming average and resulting in faster hamming convergence. The vast majority
of traditional operators, like one-point, two-point,
l-point,
uniform and punctuated crossover [Spears and DeJong, 1991, Schaeffer and Morishima, 1987, Schaeffer and Morishima, 1988, Syswerda, 1989],
do not affect the hamming average of a population. For these crossover
operators the following lemma is proved:
Proof: The hamming average in generation t+1 is proved to be the same
as the hamming average at generation t under the action of crossover alone.
Assuming a binary alphabet , express the population hamming
average at generation t as the sum of hamming averages of l loci, where l
is the length of the chromosome. Letting
stand for the hamming
average of the
locus results in:
where is
In the absence of selection and mutation, crossover only changes the order in which to sum the contributions at each locus. That is:
The hamming average in the next generation is therefore
Q.E.D
Having eliminated crossover from consideration since it causes no change in the hamming average, it remains for selection to provide the force responsible for hamming convergence.
Unlike recombination, selection intensity depends on the properties of the search space. It is the domain-dependent or problem-dependent part of a genetic algorithm. But, independent of the domain, it is possible to obtain an upper bound on the time to convergence. Obtaining an upper bound is equivalent to assuming the worst possible search space and estimating the time required for finding a point in this space. For a static function, a space on which an algorithm can do no better than random search satisfies this criterion. The flat function defined by
contains no useful information for an algorithm searching for a particular point in the space. Thus no algorithm can do better than random search on this function. In terms of selection in genetic algorithms, the flat function assigns an equal fitness to all members of a population. There is no selection pressure to lead to a decrease in hamming average since the probability of selection is now the same for every individual. However, a simple GA without mutation loses diversity and eventually converges. An expression for the time convergence on such a function gives an upper bound on the time complexity of a genetic algorithm on any static function. The GA's convergence on the flat function is caused by random genetic drift where small random variations in allele distribution cause a GA to drift and eventually converge. The time for a particular allele to become fixed due to random genetic drift is used to derive an upper bound.
If a population of size N contains a proportion of a binary allele i,
then the probability that k copies of allele i are produced in the
following generation is given by the binomial probability distribution
This distribution is useful in calculating the probability of a particular
frequency of occurrence of allele i in subsequent generations -- A classical
problem in population genetics. Although the exact solution is complex,
approximating the probability that allele frequency takes value p in
generation t is practical. Wright's approximation for intermediate allele
frequencies and population sizes, as given in [Gale, 1990], is
sufficient. Let stand for the probability that allele frequency
takes value p in generation t where
then
The probability that an allele is fixed at generation t is
Applying this to a genetic algorithm, assuming a randomly instantiated population at t = 0 ,
Assuming that alleles consort independently, which is true for a flat function, the expression for the probability that all alleles are fixed at generation t for a chromosome of length l alleles, is given by
Equation gives the time to convergence for a genetic algorithm on
a flat function and is therefore an upper bound for any static function. For
example, on a 50 bit chromosome and a population size of 30, this gives an
upper bound of
on the chance of convergence in 50 generations.
Experimental results get between
and
convergence starting with an
initial hamming average of 50/2 = 25. Previous work by Goldberg and Segrest
on genetic drift gives a more exact albeit more computationally taxing
expression for time to convergence due to genetic
drift [Goldberg and Segrest, 1987]. They also include the effect of mutation in
their analysis, which once again is directly applicable to our problem. In
fact, since the flat function costs one assignment statement and genetic
operators are computationally inexpensive, running a genetic algorithm on a
flat function with mutation and crossover probabilities set to practical values
can cheaply produce an upper bound.
Finally, that GAs converge due to random drift gives a theoretical basis to the often observed problem of premature convergence, when the genetic algorithm converges before high fitness schemas have sufficient time to recombine into the global optimum [Thierens and Goldberg, 1993]. The next section suggests one method of alleviating this problem.
Nature uses large population sizes to ``solve'' the premature convergence problem. This is expensive, furthermore engineers need not be restricted to nature's methods but can do some genetic engineering. Mutation seems a likely candidate, and in practice, is the usual way of maintaining diversity. However, although high mutation rates may increase diversity, its random nature raises problems. Mutation is as likely to destroy good schemas as bad ones and therefore elitist selection is used to preserve the best individuals in a population. This works quite well in practice, but is unstructured and cannot insure that all alleles are always present in the population.
Instead of increasing mutation rates, picking an individual and adding its bit complement to the population ensures that every allele is present and the population spans the entire encoded space of the problem. The individual to be complemented can be picked in a variety of ways depending on assumptions made about the search space. Randomly selecting the individual to be complemented makes the least number of assumptions and may be the best strategy in the absence of other information. The best or worst individual could also be selected, or probabilistic methods can be used in choosing whom to complement. Instead of complementing one individual, choosing a set of individuals allows spanning the encoded space in many directions. The most general approach is to maintain the complement of every individual in the population, doubling the population size.
The presence of complementary individuals also makes a GA more resistant to
deception. Intuitively, since the optimal schema is the complement of the
deceptively optimal schema, it will be repeatedly created with high probability
as the GA converges to the deceptive optimum [Goldberg et al., 1989]. In experiments
the minimum fitness individual was replaced with either the complement of a
randomly picked individual in the current population, or the complement of the
current best individual. Let represent the number of bits needed to
represent variable i in a deceptive problem, then the deceptive functions
used can be described as follows
Dec1 and Dec2 which are 10-bit and 20-bit deceptive problems respectively were used. Letting superscripts denote the number of bits needed for each variable, Dec1 can be described by
and Dec2 by
Figure compares the average fitness after 100 generations of a
classical GA (CGA) and a GA with complements (GAC) on a Dec1.
Figure: Average fitness over 100 generations of classical GA
and GA with complements. GAC (max) replaces the worst individual with the
complement of the current best individual. GAC (random) replaces the worst
individual with the complement of a random individual in the population
The GAC easily outperforms the classical GA, both in average fitness and the
number of times the optimum was found. In most experiments the classical GA
fails to find the optimum in 11 runs of the GA with a population size of 30
in a 100 generations. Since the converged hamming average for the CGA is very
small, it is not expected to be ever able to find the global optimum. GAC on
the other hand easily finds the optimum for Dec1 within a 100 generations and
usually finds the optimum for Dec2 within a few hundred generations.
Figures and
show the number of times the optimum was
found for Dec1 and Dec2 within a total of 100 and 1000 generations
respectively. In the figures GAC (max) replaces the worst individual with the
complement of the current best individual. GAC (random) replaces the worst
individual with the complement of a random individual in the population.
Figure: Number of times the optimum was found on Dec1 for a classical GA
compared with the same statistic for GAs with complements (GACs).
Figure: Number of times the optimum was found on Dec2 for a classical GA
compared with GAs with complements (GACs).
Although this genetically engineered algorithm can mitigate certain kinds of deception, it is not a panacea and cannot guarantee an optimal solution on all problems. Messy Genetic Algorithms [Goldberg et al., 1989] can make much stronger guarantees but their computational complexity is daunting. Not surprisingly, GAC also does better than a classical GA on Ackley's One Max and no worse on the De Jong test suite [Ackley, 1987, DeJong, 1975]. In problems consisting of mixed deceptive and easy subproblems the GAC still does much better than the classical GA
Predicting performance on an arbitrary function is more difficult than obtaining an upper bound because of the non-linearities introduced by the selection intensity. However tracking the decrease in hamming average while a GA is working on a particular problem allows predicting the time to hamming convergence.
Without mutation a GA reduces the hamming average from l/2 to zero. Therefore predicting the time to convergence reduces to the problem of finding the rate of decrease in hamming average. A general model for the change in hamming average per generation:
relates the hamming average in generation t to the hamming average
in the next generation. Finding
and solving the recurrence is enough
to predict the time to convergence.
The population hamming average at generation t can be expressed as the sum of
hamming averages of l loci, where l is the length of the chromosome.
Letting stand for the hamming average of the
locus gives:
Assuming a binary alphabet , and a large population, the hamming
average of a locus in the current population is well approximated by the
product of the number of a's and the number of b's divided by the sample
size N(N-1)/2, where N is the population size. More formally, letting
and
be the number of a's and b's at a locus i,
Since a and b are first-order schemas competing at locus i, their growth rates are given by the schema theorem. The expected growth rate for a schema s with fitness f(s) due to selection alone is
Here is the population's average fitness in generation t.
As only first-order schema are being considered, there is no modification
needed to incorporate crossover's effects. Ignoring mutation for the present,
the hamming average in generation t+1 at locus i can be calculated as
follows. First:
Then using the schema theorem can be expressed in terms of
,
The solution to this recurrence is
The denominator needs to be computed. This is an interesting problem not only because it helps in predicting the hamming average, but because it implies the possibility of fitness prediction. Once again concentrating on first-order schemas, consider the average fitness in generation t at locus i.
But also approximates the population average fitness in
low epistasis problems. The approximation gets better as epistasis decreases
and our estimates of first-order schema fitnesses get better.
Keeping this assumption in mind and dropping the first subscript
Using the schema theorem (equation ) to replace the
t terms on the right hand side
Multiplying both sides by
But by the schema theorem (equation )
Therefore
Multiplying by
In general,
which is the needed expression. This equation's significance in fitness
prediction is discussed in the next section. For the present, substituting
equation in
gives
in terms
of the initial hamming average.
For a randomly initialized population
Therefore
Rearranging terms
Here is the initial hamming average at locus i. For a randomly
initialized population, the hamming average for each locus is 1/2.
Replacing
by 1/2
Summing over i from 0 to l - 1 gives the population hamming average in
generation t
the expression for the hamming average in generation t, depends only on the fitnesses of alleles in the chromosome.
With mutation the hamming average does not converge to zero, rather a
population converges when the rate of change of hamming average is very
low or zero. Consider mutation's effect on the number of a's and b's.
Mutation flips a bit with probability , changing the number of copies of
a and b in subsequent generations. Without selection
since the a's become b's and vice-versa. Incorporating selection
The equation for hamming distance in generation t is
To solve this is needed in terms of
. Substituting
in equation
Collecting terms
This is of the form
where
The solution to this recurrence is
Before calculating the product of fitnesses, rewrite the second term as
Similarly letting
results in
Substituting these values into equation directly
Unlike the case without mutation, it is difficult to find a simple
expression for the product of fitness averages. In addition, the calculations
implied by equation are cumbersome. For practical prediction this
chapter resorts to an approximate model.
The intuition needed to address the problem is that the schema theorem works
both ways. If schema fitnesses are known, proportions can be predicted -- if
proportions are tracked, fitnesses can be predicted. Rewriting
equation as
Keeping track of over a few generations should give an estimated
fitness for
. Mutation can be handled in the same way starting with
equation
.
Solving for using the second expression
Substituting back in the first expression in equation and solving
for
which leads to
multiplying by
from which the expression for can be computed
and subsequently an expression for
The two equations and
above can be used to estimate first
order schema fitnesses by tracking their proportions over time. Using these
estimates in the expression for average fitness
in
generation t gives
where N is the population size. This results in being able to predict fitness
averages. In other words, tracking first-order schema proportions allows
estimation of their fitness. Next, using these first-order schema fitnesses
allows predicting their future proportions from the schema theorem
(equation ). Finally, knowing first-order schema fitnesses and
proportions allows predicting the subsequent population average fitness, which
is needed for estimating future proportions and so on. Although a closed-form
solution is not available for the case with mutation, the recurrences can be
programmed and run on a computer for a large number of generations.
Assuming that string similarity implies fitness similarity, the variance of fitness is a measure of the nonlinearities in the genome and hence a measure of epistasis. Fortunately, competing schemas with a large fitness variance feel high selection pressure and quickly lose diversity, diminishing epistatic effects. Since a GA exponentially reduces high variances, O(logN) generations should suffice to reduce most nonlinear effects.
The following methodology is therefore used for practical prediction:
Using an adjusted fitness computed this way, it is possible to predict the growth rate of first-order schema and therefore the population average fitness and population hamming average in subsequent generations.
Consider the effect of newly discovered schemas on the predictions. Fresh schemas that decrease adjusted fitnesses have little effect since they are quickly eliminated by selection. When higher fitness schemas are discovered by crossover, the fitness of alleles participating in these schemas increases. This means that the predicted hamming average will be greater than or equal to the observed value giving an upper limit on the time to convergence. The predictions therefore provide upper bounds on the hamming average and time to convergence to that hamming average. Such an estimate is more desirable than an overly optimistic estimate: It is better to let a GA run too long than to stop it early.
The variance in adjusted fitness for an allele bounds the number of generations sampled to get an accurate estimate of the allele's fitness. This variance bounds the current fitness of schemas in which the allele participates. At a particular locus, using the allele with the least variance to predict growth gives a conservative estimate of hamming average. Low variance implies a good approximation to the current fitness of the allele and provides an upper bound on our predicted hamming average. Similarly, using the allele with greater variance gives an optimistic estimate or lower bound. These bounds should surround the predicted value of the hamming average. Results given in the following sections support this preliminary analysis.
Noisy functions and noise induced by mutation may cause less desirable, more unpredictable behavior. This problem can be solved by having a larger sample space, that is, using more generations to calculate adjusted fitnesses.
For much the same reasons above, calculating adjusted fitnesses allows predicting a lower limit on the fitness average because these fitnesses only incorporate the effects of current schemas. Any new schemas (created by crossover) that increase the adjusted fitness of an allele can lead to a higher than predicted fitness average. In multimodal functions, the opposite may happen. The GA could initially concentrate around one optimum and then move toward another optimum. Since the calculated adjusted fitnesses may no longer be correct and it takes a GA time to overcome the inertia created by the first optimum, the predicted fitness average may be higher than the actual fitness average.
Since hamming average prediction is quite robust and a first-order
approximation may be interesting enough, a linear approximation of in
equation
can be tried out for prediction. Assuming a quick and simple
model for the case without mutation
whose general solution is given by
This chapter compares this simple equation's predictions with the predictions
generated using the analysis in section . The task is to predict
the hamming average at generation 50 for the following functions:
All problems were reduced to maximization problems. The GA population size in
all experiments was 30, using proportional selection and two-point
crossover. a in equation was computed from successive hamming
averages in generations 0 through 10. Table
summarizes the
results. The last two columns predict lower and upper bounds from the analysis
in section
(equations
,
, and
),
using proportions sampled in generations
through
9 only. Since the encoding for the DeJong functions used a sign bit, 5
samples were optimistically chosen as sufficient. The experimental values are
averages over 11 runs with no mutation. The standard deviations over 11
runs of the GA are given in parentheses.
Table: Comparing actual and predicted hamming averages (no
mutation)
Surprisingly, the values predicted using the rough approximation are quite good and are within a standard deviation of the observed values. That good results come from even such a simple model clearly indicates the validity of this approach. Predicted upper and lower bounds on the upper limit include the observed values except for OneMaxI whose lower bound is greater than the observed hamming average. This implies a large variance in sampled allelic fitnesses. Intuitively, the observed fitness of an allele depends on the number of ones in the entire chromosome, leading to the observed conservative estimate in the absence of mutational noise. OneMax should be expected to create greater problems when including mutation.
Next, realizing that mutation makes the final hamming average greater than 0, another related simplified model is used to predict hamming average at convergence for the same set of problems but with the mutation rate set to 0.01. The model is given below.
The general solution is
In these experiments the GA ran until hamming convergence when the hamming
average did not change significantly over time. Experimentally, 500
generations was sufficient for the problem set. Calculating the values of a
and b in equation using the method of least squares, the
hamming average is predicted at convergence (generation 500). This is found
by setting
and solving for
, resulting in
Table compares the observed hamming average at generation 500
with the predicted values. Column three tabulates the hamming average at
convergence and its standard deviation over the 11 runs. Columns four and
five are the predicted hamming averages using equation
and data
from generations 5 through 20 and 5 through 45 respectively. The
simpler model (curve fitting) does much worse when mutation is incorporated. It
requires more samples, and even predicts negative hamming averages at
convergence. However, the more refined model developed in this chapter
predictably needs no extra information and therefore still uses samples from
generations 5 through 9 to produce the last two columns.
Table: Comparing actual and predicted hamming averages with
the probability of mutation set to 0.01, the *'s indicate that the
negative values predicted in some runs were discarded.
The predictions of the model developed in this chapter, displayed in
table , are very close to the observed values. Although the hamming
average after 500 generation is large for F4 and OneMaxII, there is no
significant change even after 2000 generations. This just indicates that
once a GA reaches the equilibrium hamming average, some other search method
should be tried especially if the converged hamming average is large. Predicted
upper and lower bounds are still good except for OneMaxII. Here, noise induced
by mutation is the major culprit. The high variance in allele fitnesses
suggests using a larger sample space to increase the signal to noise ratio.
Sure enough, using an additional 15 generations to sample fitnesses suffices
to correct the problem and give good bounds on OneMaxII.
Since a simplified curve fitting model gave good results on hamming average
prediction, equation was used for fitness prediction as well.
Table
predicts the average fitness (F) at convergence without
mutation. Column two gives the initial average fitness and column three the
converged average fitness in generation 50. The standard deviations of the
observed and predicted average fitnesses over the 11 runs are in parentheses.
The next two columns are the predicted fitness averages using
equation
from fitnesses sampled in generations 5 through 15
and 5 through 25 respectively. The last two columns use equation
to provide conservative and optimistic bounds on the average fitness. As stated
earlier, both bounds are expected to underestimate the observed fitness, unless
the function is multimodal or has high fitness variance. Since the average
fitness is a byproduct of hamming average prediction, the hamming average data
from generations 5 through 10 is used for fitness prediction. The
flat function is not considered.
Table: Comparing actual and predicted fitness averages (no
mutation)
The simple linear model's predictions of average fitness are are also within about one standard deviation of observed values. It is more interesting to look at the upper and lower bounds. As expected F2 and F5 which are multimodal, have higher predicted fitnesses. Other predictions serve as conservative and optimistic lower bounds.
Table summarizes results on the same set of problems with mutation
probability set to 0.01. Mutation's effect forces good predictions to use
even larger data samples for the simpler linear model. The upper and lower
bounds again fall out of the hamming average data for generations 5 through
9 with mutation.
Table: Comparing actual and predicted fitness averages with the
probability of mutation set to 0.01
These preliminary results support this chapter's model of genetic algorithm
behavior. Although curve fitting has its uses in providing a very rough
approximation, it cannot predict the effects of changes in GA parameters and
does badly in the presence of mutation. Using the analysis in
section it is easy to predict the time to convergence on a
problem. This is just the time at which the change in hamming average is
insignificant. Knowing available computing resources fixes the hamming average
below which exhaustive search of the remaining space can be undertaken, and
therefore the time needed to deliver on an application.
In practice, instrumenting a GA to keep estimates of allele fitness and
proportions should reduce the time spent experimenting with various parameter
settings and let the developer know the computing resources needed. If the
problem encoding is highly epistatic or otherwise undesirable from a prediction
point of view, just keeping track of the hamming average and rate of change of
hamming average indicates the progress of a genetic algorithm on the problem.
In addition, incorrect predictions from the theory indicate that the problem
encoding exhibits these undesirable properties. This is useful information
about the problem space and can be used in changing the GAs parameters or the
problem's encoding.
Analyzing a GA is complicated because of its dependence on the application domain and our lack of understanding of the mapping between our view of the application domain and the GA's view of the same domain. The GA's view of the domain is implicit in the chosen encoding. However, the overall behavior is predictable and can be deduced from sampling suitable metrics.
Analysis of the syntactic information in the encoding gives a surprising amount of knowledge. This chapter derived an upper bound on the time complexity from considering the effects of drift on allele frequency. Then, from a model of the effect of selection and mutation on the behavior of genetic algorithms, the analysis was able to approximate two solution characteristics at convergence for static search spaces.
This dissertation has helped develop the foundations for the use of genetic algorithms as a computational tool for design. The last chapter summarizes these developments, discusses their limitations and makes explicit the questions arising from the research.