Sorting methods

	Test and evaluate the performance of sorting methods

Qinxue Chen and Wei Liu

1. Motivation of the project

Sorting keys of a list into nondecreasing order is very practical in the real world. There are several good reasons for studying sorting algorithms. First, they are of practical use because sorting is done often. Just as having the entries in telephone books and directories in alphabetical order makes them easy to use, working with large sets of data in computers is facilitated when the data are sorted. Second, quite a lot of sorting algorithms have been devised, and studying a number of them should impress upon people that the fact that one can take many different points of view toward the same problem. The discussion of the sorting algorithms in this project should provide some insights on the questions of how to improve a given algorithm and how to choose among several. Third, sorting is one of the fewing problems for which we can easily derive good lower bounds for worst-case and average behavior. The bounds are good in the sense that there are algorithms that do approximately the minimum amount of work specified [Baase, 1996].
In this project, the three sorting algorithms, i.e., Insertion sort, Heapsort, and Quicksort, are chosen to do our performance test and evaluation. They can compare keys but can not do other operations on the keys. They are all internal sorts. The measure of work used for analyzing algorithms is the number of comparisons of keys. The number of assignments of list entries is also considered because they uniquely identify the elements of list. CPU time for the sorting algorithms is another very important criteria, because it directly tells how fast a sorting algorithm can be. Once we test and evaluate the performace of the three sorting methods, we can choose an appropriate sorting algorithm for a particlar problem.
In the next section, we will discuss the properties of the three sorting algorithms. In section 3, computation setups are presented. In section 4, we discuss all the algorithm thoroughly, state their advantages and disadvantages, and talk about application strategy of the three sorting algorithms. In section 5, we will draw conclusions of the project.

2. Theory

1) Insertion sort
Insertion sort is a most natural and general one, and its worst case and average behavior analysis are easy. The sorting algorithm is based on the following steps. Assume that at some step, the list of numbers to be sorted has been split into two sublists where the elements in the left sublist are sorted while the elements in the right sublist are still in random order. Then take the first element from the right sublist and enter it in the correct place in the left sublist. When the length of the right sublist becomes zero then the list of numbers has been sorted.
The best case for Insertion sort is sorting a nondecreasing list. For a list with n keys, the number of comparisons will be (n-1) times. The run time is \(*H(n). The worst case for insertion sort is sorting a decreasing list. The number of comparisons will be n(n-1)/2. The run time is \(*H(n^2). For an average case, the run time is \(*H(n^2). It is an efficient algorithm for small input sizes. Because constant extra memory is required for every input size n, it is a in-place sorting algorithm.
2) Quicksort
Quicksort uses the devide and conquer technique to divide the problem into smaller instances of the same problems, then solve the smaller instances recursively, and finally combine the solutions to obtain the solution for the original input. Quicksort's strategy is to rearrange the keys to be sorted so that all the small keys precede the large keys. First split the array L[p..r] into two subarrays L[p..q] and L[q+1..r] such that each element of L[p..q] is less than or equal to each element of L[q+1..r]. Then sort the two subarrays by calling Quicksort recursively [Baase, 1996].
Running time of Quicksort depends on splitting. Balanced splitting gives rise to fast running time. The running time for a best case is \(*H(nlgn) (for example, a list with same keys). Unblanced splitting leads to slow running time. The running time for a worst case is \(*H(n^2) (for example, a sorted list split with the first key). The average running time is \(*H(nlgn). Generally randomized Quicksort is used. It does not improve worst-case but it makes the worst-case input less likely. In this project, randomized Quicksort is used.
3) Heapsort
Heapsort algorithm uses a data structure called heap. If the keys to be sorted are arranged in a heap, then we can build sorted list in reverse order by repeatedly removing the key from the root, filling the output array from the end, and rearranging the keys still in the heap to reestablish the heap property, thus bringing the next largest key to the root. Since this approach requires constructing a heap in the first place, and then repeatedly doing some rearranging of the keys in the heap. The algorithm includes two phases, i.e., building a heap and the heap strategy. The running times for best case, worst case, and average case are all \(*H(nlgn).
All of the properties of the three sorting algorithms will be tested and evaluated in section 4.

3. Computation setup

1) Description of the computation code
The code for testing and evaluating the performance of the three sorting algorithms is written in C++. The header file "sort.h" includes the declarations of the three sorting methods. The source file "sort.cpp" includes the implementations of the three sorting algorithms, which follows the pseudo-code descriptions in the text book. The header file "timer.h" declares a class Timer, which can tell CPU time in seconds for different sorting algorithms. The source file "timer.cpp" includes the class's member functions' implementations. The source file "main.cpp" is a menu-driven program for testing the three sorting algorithms. One option is to read a file of numbers. One option can run one of various sorting methods on the list of numbers or quit, and the last one option is to print the unsorted or sorted list.
Those files are kept in the directory "/grads/wliu/cs665/proj1" or "/grads/chen_q/665/proj1". There is a makefile to produce an executable file "runsort" which can be run. The alternative is to do "g++ -o runsort main.cpp sort.cpp timer.cpp". Then type "runsort" in UNIX command line to get test results. All of our test runs are carried out in aspen.
2) Description of test data sets
To understand the three sorting algorithms thoroughly, we create several files of integers to be used to test the sorting methods. The files are in different sizes. Those files are in order, reverse order, in random order, and partially in order.
Data numbers 1 to 7 are random data sets which are included in a file "random.dat". They have sizes 2000, 10000, 20000, 50000, 80000, 100000, and 200000 respectively.
Data numbers 8 to 10 are in-order data sets which are included in a file "inorder.dat". They have sizes 2000, 20000, and 200000 respectively.
Data numbers 11 to 13 are reverse-order data sets which are included in a file "reverse.dat". They have sizes 2000, 20000, and 200000 respectively.
Data numbers 14 to 16 are respectively "14.dat", "15.dat", and "16.dat". They are partially in order with sizes 2000, 20000, and 200000.

3. Analysis and discussion

1) Comparison of the three algorithms
First we consider the performance for sorting datasets with large sizes (n>=2000). We test the three sorting algorithms with the created data sets. Tables 1, 2, and 3 present the results of the three sorting algorithms. In the tables, the first row shows the name of the algorithm used. Below that row, each column respectively gives dataset size, input data order type, CPU time the sorting takes, the number of comparisons used, and the number of assignments used. With those three tables, we can easily compare the three algorithms from several aspects. We compare them with several special datasets case by case. Then we come to general cases to compare them with random datasets. Remember that, when we discuss the performance of the algorithms, the same data sets are used.
Sorting in-order data is the best case for Insertion sort, only a general case for both Heapsort and Quicksort. We can see, for the same data sizes (2000, 20000, and 200000 respectively), that Insertion sort uses the least CPU time, the smallest comparions and assignments. Quicksort uses the most CPU time, the second largest assignments, and the largest comparisons. Heapsort uses the second most CPU time, the largest assignments, and the second largest comparisons. Also if data size increases 10 times from 2000 to 20000, or from 20000 to 200000, comparisons, assignments, and CPU time all increase about 10 times for Insertion sort, about 11~20 times for Heapsort and Quicksort. In this case, Insertion sort has the best performance. The reason is that the running time for the best case of Insertion sort is \(*H(n), but the running time for a general case of Heapsort or Quicksort is \(*H(nlgn). For large number of data sets, Insertion sort is the fastest. In the practical sorting, this case occurs rarely. Note that randomized Quicksort makes worst cases less possible, which will be the case for in-order data if the pivot point in Quicksort is chosen as the first element.
Sorting reverse order data is a worst case for Insertion sort, a general case for both Heapsort and Quicksort. For the same data sizes (2000, 20000, and 200000 respectively), Insertion sort uses the most CPU time, the largest comparisons and assignments. Quicksort uses the second most CPU time, the smallest assignments, and the second largest comparisons. Heapsort uses the least CPU time, the second largest assignments, and the least comparisons. Also if data size increases 10 times from 2000 to 20000, or from 20000 to 200000, comparisons, assignments, and CPU time all increase about 100 times for Insertion sort (except that the CPU time is out of range of double), about 11~20 times for Heapsort and Quicksort. In this case, the Insertion sort has the worst performance. The reason is that the running time for the worst case of Insertion sort is \(*H(n^2), but the running time for a general case of Heapsort or Quicksort is \(*H(nlgn). For large number of data sets, the Insertion sort is the slowest. In the practical sorting, this case occurs rarely.
Sorting partially in-order data is a general case for all of the three sorting algorithm. For the same data sizes (2000, 20000, and 200000 respectively), the Insertion sort uses the most CPU time, the largest comparisons and assignments. Quicksort uses the second most CPU time, the smallest assignments, and the second largest comparisons. Heapsort uses the least CPU time, the second largest assignments, and the least comparisons. Also if data size increases 10 times from 2000 to 20000, or from 20000 to 200000, comparisons, assignments, and CPU time all increase about 100 times for Insertion sort (except that the CPU time is out of range of double), about 11~20 times for Heapsort and Quicksort. In this case, the Insertion sort has the worst performance. The reason is that the running time for the general case of Insertion sort is \(*H(n^2), but the running time for a general case of Heapsort or Quicksort is \(*H(nlgn). For large number of data sets, the Insertion sort is the slowest. In the practical sorting, this case can occur in the practical world. Also performance of sorting partially in-order data for Insertion sort is better than sorting reverse data, but worse than sorting in-order data. There is no much difference for Heapsort and Quicksort to sort the three different data.
Now we turn to the performance of sorting random datasets. The results will give general performance of the three sorting algorithms. From the tables, the Insertion sort uses the most CPU time, the largest comparisons and assignments. Quicksort uses the second most CPU time, the second largest comparisons and the least assignments. Heapsort uses the least CPU time, the smallest comparisons and the second largest assignments. Also if data size increases 10 times from 2000 to 20000, or from 20000 to 200000, comparisons, assignments, and CPU time all increase about 100 times for Insertion sort (except that the CPU time is out of range of double), about 11~20 times for Heapsort and Quicksort. Sorting random data is a general case for all of the three sorting algorithms. The running times for the Insertion sort, Quicksort, and Heapsort are respectively \(*H(n^2), \(*H(nlgn), and \(*H(nlgn). Thus, the Insertion sort is the slowest. Both Quicksort and Heapsort have the same running time denotation, but their hidden constants are different. The randomized Quicksort has so large a constant that it results in an algorithm is worse than Heapsort almost in every case. For the random datasets, Figures 1 and 2, 3 and 4, and 5 and 6 show the number of comparisons, the number of assignments, and CPU time as a function of data size respectively. In each figure, the performance of the three sorting algorithms is compared. The results are the same as the above mentioned. We can also see, for Insertion sort, that the number of comparisons, the number of assignments, and the CPU time increase with n^2. For both Heapsort and Quicksort, they increase with nlgn. The Insertin sort has the worst performance; Quicksort has the second worst performance; Heapsort has the best performance.
Then we come to evaluate the performance for sorting datasets with small size (n<80). Figures 7 and 8 show the number of comparisons, the number of assignments as a function of data size respectively. Again, it is from testing random datasets (also in file "random.dat"), and the same dataset is used for the three algorithms. For n<20, the Insertion sort almost has the same comparisons as Heapsort. For n<25, the Insertion sort has the smaller comparisons than Quicksort, and the insertion sort has almost the same assignments as Heapsort and Quicksort. In all, for n<25, the Insertion sort has better performance than Quicksort, and it is almost close to the Heapsort.
In the last, we want to make clear why some authors mention that Quicksort has better performance than Heapsort. There is a little difference between the Quicksort they mentioned and the randomized Quicksort used in this project. In those books, the pivot point for splitting is chosen as the first element. The algorithm is usually faster than Heapsort, but occasionally has bad performace. For random datasets, Figures 9, 10, and 11 show the number of comparisons, the number of assignments, and CPU time as a function of data size respectively. In each figure, the performances of the Quicksort and Heapsort are compared. It is easy to see that the CPU time, comparisons, and assignments for Quicksort are smaller than those of Heapsort. We also devise a new method to do random split for randomized Quicksort. Instead of finding a random pivot point to exchange with the first element, we choose the pivot point randomly. It reduces the sorting CPU time and uses less assignments. CPU time for Quicksort is still more than that of Heapsort.
In all, our results are in agreement with theory.
2) Stability of the three algorithms
Insertion sort is stable because the algorithm preserves the original order of equal elements. Heapsort is not stable. For example, if all of the elements in the heap are the same, the last element becomes a new root after the first remove of the root. In such a case, it will not preserves the original order later on. Quicksort is also not stable. For example, if all of the elements in the list are the same, the first and last elements exchange in the first split process. It will also not preserves the original order later on.
3) Advantages of the three algorithms
Insertion sort is the easiest algorithm to program and debug. Insertion sort for small size inputs is efficient. For n<20, its performance is better than Quicksort. Both number of comparisons and assignements are smaller than Quicksort. It is an in-place sorting algorithm.
Heapsort's running time is \(*H(nlgn) for all cases, which is faster than running time \(*H(n^2). It is especially good for large size input. Number of comparisons is the smallest among the three sorting algorithms. CPU time is also the smallest among the three. It is an in-place sorting.
Quicksort's runing time for an average case is \(*H(nlgn), which is also faster than \(*H(n^2). For a large number of input sizes, the performance is better than that of the Insertion sort. Randomized Quicksort can makes the worst case (running time is \(*H(n^2)) less possible. In practice the hidden constant is smaller than those involved in heapsort.
4) Disadvantages of the three algorithms
Insertion sort for large size input is not efficient. Both the number of comparisons and assignments are the largest among the three. The sorting algorithm takes the most CPU time among the three. Worst case is not easy for this sorting algorithm to avoid because the average running time is \(*H(n^2).
Heapsort is more difficult to program and debug than Insertion sort, and it is inefficient to sort small size data. Generally Heapsort uses more assignments than that of Quicksort.
Quicksort is more difficult to program and debug than the Insertion sort. Especially it takes more time to debug the recursive function. The hidden constant associated with this Quicksort is so large that it results in an algorithm worse than Heapsort in every case [Brassard and Bratley, 1996]. Quicksort uses more comparisons than that of Heapsort. There are more CPU time used than that of Heapsort. It is also not an in-place algorithm.
5) Strategy of application of the three sorting algorithms
For small number of data sizes, we can use Insertion sort. It is easy to program and run faster for small size input. The other two algorithms has no such advantages.
For large number of data sizes, we can use Heapsort. The running time is always \(*H(nlgn). Insertion sort tends to get running time \(*H(n^2), and Quicksort can not avoid the worst case O(n^2). The alternative is to use Quicksort.
Quicksort is a preferred sorting methods in applications that require a sorting algorithm that is usually very fast, but on occasion can have a longer running time. So randomized Quicksort is used as in this project. But the hidden constant associated with this improved version of Quicksort is so large that it results in an algorithm worse than Heapsort in every case [Brassard and Bratley, 1996].
6) Importance of the number of comparisons and the number of assignments
From our test, both the number of comparisons and the number of assignments contribute to the sorting running time. The roles they play in running time determine by the cost of assignment and comparison. Basically, the number of comparisons is more important. It directly determine the efficiency of a sorting algorithm. In this project, for same data sets, Heapsort uses more assignments but less comparisons than Quicksort. The CPU time used for Heapsort is smaller than Quicksort. It is a very good example. But for a given sorting algorithm, both paramemters should be considered in sorting running time.
7) CPU time issues
In this project, CPU time gives a general criteria to compare the three sorting algorithms. It tells which sorting algorithm is fast. But it does not tell exactly how much CPU time used by the measured sorting algorithms. The reason can be described in several aspects. The first is the extra time from loading arrays. The second is the extra time from caculating the two global variables. As far as those extra time is very smaller than the time used in comparisons and assignments, the CPU time is correct.
The usage of different data type of the two global variables, i.e., the number of comparisons and the number of assignments, also affects our calculations significantly. When we use int type of integers or float type variables, those number are out of range if the data size is over 100000. When we use double long type of integers, our calculation will take about 60 CPU hours for Insertion sort if the data size is 200000, which is hard to carry out. Finally, the two global variables are declared as a double type.

5. Conclusions

1) Our results are in agreement with theory very well. The properties of the three sorting algorithms (such as best case, worst case, and average case) are quantatitatively confirmed.
2) Both the number of comparisons and the number of assignments contribute to the sorting running time. The number of comparisons plays a more crucial role in sorting speed.
3) For small size data sets, Insertion sort is more efficient than Quicksort and Heapsort.
4) For large size data sets, Heapsort is better than the other twos, Heapsort is a better choice. In such a case, Insertion sort must be avoided.
5) Randomized Quicksort makes worst cases less likely. Because its average running time is \(*H(nlgn), it still have some applications. In our project, a new random split process makes Quicksort very fast.
6) For different problems, we should be very careful to choose data type (such as int, float, double...) to do our calculations.

References

Baase, S., Computer Algorithms, second edition, Addison-Wesley Publishing, 1996
Brassard, G., and P. Bratley, Fundamentals of algorithmics, Prentice-Hall, Englewood Cliffs, NJ, 1996.

6. Attachments

1) Tables 2) Figures 3) Source codes