Next: Results Up: CBR Assisted Explanation of Previous: The System

The Case-Base

After clustering the first few hundred individuals using genotype and fitness values, a case-base is created. As explained above, clustering determines the indexing scheme of the case-base. All abstract cases, which are indexed at the internal nodes of the tree, must be recursively computed from the leaves up. Although much of the initial case data saved during the GA run is not used by the clustering algorithm, it is added to the cases in the case-base.

Cases include the following information:

Case number
Distance from the root of the tree to the level of the case
Schema for the case (see discussion in §1)
Schema order
Average fitness
Weight: Number of leaves (individuals) below
Generation information: the earliest and latest leaf occurrence as well as the average in the subtree
Additive schema: Bitwise sum of alleles of the genotypes of all the leaves below

Using the information supplied in the case-base, our system produces a report regarding the GA run so far. Using fitness as a metric, computing the top or bottom ranked n cases is easy, as is finding the top or bottom ranked n case schemas. These schemas can be combined to make new schemas by applying standard schema creation rules. For example, the two schemas **110**10 and **100***0 combine to the new schema **1*0***0. The resulting schemas show (among other things) which alleles are important in the genotype and which ones are not. Since order, longevity, and weight information is available, it is straightforward to find the top few cases with given sets of these characteristics.

We find it useful to calculate not only the schema which includes the subset of leaf cases, but one which records in a more democratic fashion the relative frequency of allele settings for the positions which are not absolutely fixed. Since the normal schema-producing rules result in a ``*'' wherever there is disagreement about an allele among leaves, schemas near the top of the tree tend to be filled with ``*''s. To avoid this information loss, we compute an additive schema by computing the bitwise sum of the genotypes associated with the leaves below.

An elected schema can be computed by dividing the additive schema by the weight and squashing the results to 0, 1, or ``*'' using thresholds that can be changed. Elected schema information provides a more informative way of looking at upper-level schemas. Elected schemas play a large role in hypothesis generation discussed in §7.

A schema actually describes a convex hull for the leaves from which it was computed. That is, it describes a portion of the space within which the unaltered leaves fit. Since the normal schema-producing rules result in top-level schemas with a large number of ``*''s and the hull tends to span too large a portion of the entire space, we prefer to describe a smaller hull, one for which the average hamming distance from an actual leaf to the hull surface is less than some predetermined constant. In other words, the leaves deviate from their closest encompassed relative by only a small amount on average. The elected schema describes this smaller hull.

Generation information can be used to determine when each part of the search space covered by the GA was considered. This information is very useful in detecting premature convergence, finding local minima, and computing the relative longevity of a subspace. The search space itself can be carved into approximately equally populated subsections by chopping the tree at nodes having a certain maximum weight. Analysis of these subsections across dimensions of fitness and longevity sheds light on the search space.

All of this processed information is available to the user of our system on-line. If desired, any case can be displayed. Also available is the tree information in graphical form with case numbers in their proper locations. The graphics are zoomable so that subtrees can be thoroughly investigated. The ``report'' is meant for user consumption and usually raises some questions which can be answered by pulling more information from the system.

It is important to emphasize that our system is an interactive tool for use in explaining GAs. Although it automates the information extraction and processing to some extent, human perception is a critical ingredient in the final analysis. However, without reasonable organization and pre-processing which our system provides, the large amount of data created during a GA run is impossible to digest even for an experienced GA researcher.

Next: Results Up: CBR Assisted Explanation of Previous: The System

Sushil J. Louis
Wed Jun 25 14:10:37 PDT 1997