HW2

Abalone is a common name for any of a group of small to very large edible sea snails. The age of abalone is mostly determined by breaking open the shell from the cone and using a special stain to identify the number of rings they bear. We would like to find a way to identify the age of an abalone using decision tree learning on a data set obtained from the University of California Irvine machine learning repository .

The data set consist of 4177 instances of 8 attributes and one class represented by the number of rings. The 8 attributes are sex, length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight.

Preparing Data Set

A careful study of the data set revealed two outliers that were related to the height. One value was 0.515 and 1.13 compared to the whole data being localized between 0 and 0.3 ( see fig 1 ). These two instance were withdrawn from the list making out data set instances 4175. Furthermore, the class data had a range of 1 to 29 and for the tree algorithm to work we had to discretize the large range into a more acceptable number. Initially, a split of 0 - 9 number of rings representing young abalone, 10 - 18 for adult and 19 - 29 for old abalone was chosen but due to a very low population of instances for the old group, a reorganization was adopted resulting in a binary class of 0-9 for young abalone and 10 - 29 represented the older ones ( See training file and testing file ). This was achieved by using an IF function in MS excel. No missing values were observed with the data.

The data set was then divided into two, 500 randomly selected instances for testing and the rest was used for training and validation. This was achieved by generating a random number for every instance and then sorted them out, then the first 500 was pick for the testing of the model while the remainder was saved for the training. All these were done in MS excel and the two files obtained were converted to an ARFF file to be processed by WEKA ( a machine learning open source program ).

Training The Data Set

The training file was loaded into WEKA for the learning process. A C4.5 learning algorithm ( weka classifier j48 ) was selected for the decision tree type. C4.5 is a popular decision tree learner with major advantages such as being able to handle numeric attributes and many other attribute types, missing values and also implement pruning. The minimum number of instances per leaf was made 50 and post-pruning was activated with a confidence factor of 0.25. Two types of pruning was used in the data processing, post-pruning , and online pruning. Post-pruning in the C4.5 algorithm evaluates the decision error (estimated percent misclassifications) at each decision junction and propagates this error up the tree. Its effect is controlled by the confidence factor. Lowering the confidence factor decreases the amount of post-pruning performed on the training set. Online pruning operates on the decision tree while it is being induced. Whenever a split is made which yields a child leaf that represents less than a minimum number of instances per leaf, the parent node and its children are compressed into a single node. These two pruning methods help to curb over fitting and reduce the size of the tree.

10-fold cross-validation was used to evaluate the performance of the classifier. It is a standard method and has been confirmed by extensive experiments to be the best choice for getting accurate estimates, thus stratification helps reduce the estimated error variance. Therefore, the training data is split into 10 subsets of equal size, each subset in turn is used for testing while the remainder is used for training. This was done for all possible combination and was handled by WEKA.

Results

After processing the training using 10-fold cross-validation on the data set, an optimized tree model of size 27 with 15 leaves was achieved. The correctly classified instances was 2870 (78.1 % of the trained data set ) and the incorrectly classified instances was 805 ( 21.9 % of the trained data set ). The confusion matrix for the training is shown in table 2. 1426 of the young class true positive with 413 being true negative. Also, 1444 of the instances were false negative while 392 were false positive.

The output results of the final model can be seen here. It is evident that the learning algorithms did not depend on the following attributes: length, height, and viscera weight but rather used sex, diameter, whole weight, shucked weight, and shell weight to predict the age of the abalone. This reduction occured due to the level of pruning that was applied.

The results indicate a modest tree size with a good performance. Variation in pruning parameters and ring discretization yielded poor performance and bigger tree size ( See output of model 1.txt, model 2.txt and model 3.txt ).

Tree Rule

if shell_weight <= 0.249 then
   if shell_weight <= 0.154 then age = Young -------------------- (1140.0/147.0)
   if shell_weight > 0.154 then
       if shucked_weight <= 0.3025 then
            if sex = M
                  if shell_weight <= 0.1775 then age = Young ----------- (55.0/26.0)
                  else age = Old ------------------------------------- (77.0/17.0)
            if sex = F then age = Old ------------------------------- (142.0/34.0)
            if sex = I then
                  if whole_weight <= 0.6175 then age = Young ---------(114.0/27.0)
                  else age = Old -------------------------------------(52.0/23.0)
       if shucked_weight > 0.3025 then
            if shucked_weight <= 0.3745 then
                  if whole_weight <= 0.788 then age = Young ----------- (159.0/40.0)
                  else age = Old ---------------------------------(69.0/25.0)
            if shucked_weight > 0.3745 then age = Young ------------ (184.0/39.0)

if shell_weight > 0.249 then
       if shucked_weight <= 0.3965 then age = Old ---------------- (356.0/55.0)
       if shucked_weight > 0.3965 then
            if shell_weight <= 0.2905 then
                  if sex = M then
                      if diameter <= 0.46 then age = Old ---------------- (57.0/27.0)
                      else age = Young ------------------------------- (59.0/21.0)
                  if sex = F then age = Young ------------------------ (87.0/41.0)
                  if sex = I then age = Old --------------------------- (22.0/9.0)
            if shell_weight > 0.2905 then age = Old ----------------- (1102.0/192.0)

The numbers in brackets after the leaf nodes indicate the number of instances assigned to that node, followed by how many of those instances that are incorrectly classified as a result.

Testing the Model

The testing of the randomized 500 sample yielded similar prediction accuracy as indicated by the 10-fold cross-validation performance estimate. The correctly classified instances was 78 % while the incorrectly classified instances was 22 %. The precision for the young class was 77.4 % and that of the old was 78.6 %. The confusion matrix for our prediction is shown in table 3. The true positive is about 4 times that of the true negative while the false negative is approximately 3 times that of the false positive.

The difference between the estimated performance and the predicted performance is marginal. This is an indication that the model did not suffer from over-fitting. The lack of higher levels of prediction compared to what we observed may be due to the amount of noise that exist in the data. For instance, the infant group turn to have larger number of rings than expected. Moreover, males were having lower number of rings compared to females, therefore, turn to fall under the class of the young. Also discretizing other attributes like height, length, e.t.c, had almost the same level of performance ( i.e 78.9% ) but revealed a worsening in the prediction performance (accuracy of 71% see model 4.txt ).

Conclusion

Our quest to develop a learning decision tree to predict the age of abalone using 8 of its attributes was successful with a performance accuracy of 78%. A few outliers ( 2 of them with abnormal height ) had to be removed from the data to help get a good data set that represent the majority. Discretizing the number of rings played a crucial role in the performance of the learning process. A binary class set yielded higher performance to a trinary class. The splitting of the data into training and testing sets was done by a random process and a 10-fold cross-validation together with online and post pruning was opted for the training. It was observed that the type of pruning and the method of training determines how good your classification will be in terms of tree size, accuracy and over fitting. One of the major discovery was that the prediction of the age of an abalone was independent on the length, height, and viscera weight. These attribute did not affect the quality of the out come. Hence, only 5 attributes were needed to determine the age of an abalone.

minimum number of instances per leaf	30	50	60
Training Performance	78.9%	78.1%	78.6%
Test Performance	77.6%	21.9%	77.2%