\chapter{Linear Octree-based Parallel Collision Detection Algorithm} \section{Introduction} As presented in chapter 2, collision detection is a multi-faceted problem requiring individual solutions to subsets of the collision detection problem. Current collision detection algorithms use a collection of algorithms to solve subset problems. The relationships between collision detection algorithms and how they can be used in conjunction with one another is best described as a pipeline. By optimizing each algorithm used in a collision detection pipeline, fast solutions to the general collision detection problem can be devised. This chapter describes the data structures and algorithms explored in the development of a parallel collision detection algorithm designed to solve the all-pairs problem. This chapter is organized into four sections. In the first two sections, two octree based data-structures considered for use in a parallel collision detection algorithm are described. In the third section a new parallel collision detection algorithm for solving the all-pairs problem is presented. Finally, an algorithmic analysis of the new algorithm is presented. \section{Distributed Octree Data Structure} In the process of designing a parallel collision detection algorithm, two strategies were explored, the first of which involved pursuing a distributed octree implementation that would facilitate parallel processing for collision detection. A distributed octree is an octree data-structure in which the octants, and the objects they contain, are assigned to separate processors (ref fig). In this configuration, objects can be operated on concurrently by the processors that manage the octants within which the data are contained. A distributed octree data structure has several qualities that make it appropriate for implementing a parallel collision detection algorithm. \begin{enumerate} \item Decomposition of an octree hierarchy for distribution is easy to understand and implement as a modified pointer-based tree. In the description of pointer-based octrees from Chapter 2, an octree node stores pointers to its child nodes. To facilitate the distribution of octant nodes to multiple processors, pointers used for referencing child octants are generalized, enabling them to reference a processor that manages octants. From the implementor's perspective a distributed octree, the generalized pointer can be viewed as addresses used for communicating with processors that manage octants. \item Distributing an octree across a cluster of networked computers is conceptually the same as distributing an octree across the processors of a shared-memory, multi-processor computer. As a function of the structure of an octree, clearly defined boundaries between regions of space are created. As a result of the logical structure of an octree, octree data structures have clearly defined regions by which data are segregated. In an octree, these boundaries are defined by the volume of an octant, while in an octree data structure, the boundaries are defined by the tree nodes in which data are stored. This feature allows tree nodes to be physically separated without compromising the logical structure of the octree. To distribute an octree data structure over a cluster of networked computers, pointers to octants are allowed to be a computer network address where child octants are located. \item When an octree is distributed across a cluster of computers, the aggregate size of the octree data-structure can exceed the largest octrees that can be created on a single computer. This allows for larger octree to be constructed. The maximum size of an octree which can be created on a single computer is limited by the amount of random access memory needed for storing the octree data-structure. By distributing octants to several computers, the memory from each computer contributes to the total space available for storing the octree data-structure. \item Distribution of data processing is a function of storing data in octants. As objects are placed in octants that bind their position and volume, they are accordingly distributed among the processors that manage the octants. \item The complexity of distributing an octree across multiple processors does not grow as a function of the number of processors over which it is distributed. As a result of the structure of an octree, once a program has been written to handle a single distributed octant, the same program should scale to any number of distributed octants. \end{enumerate} The layout of a distributed octree can be visualized as follows: (insert picture of octree sub-hierarchies on processors) The concept behind the first attempted parallel collision detection algorithm was to use a distributed octree to partition and store objects on multiple processors. Each processor would then perform collision detection between objects they stored locally. However, after studying distributed octree implementations, several undesirable qualities were revealed that make distributed octrees unsuitable for use in a parallel collision detection algorithm. The two undesirable qualities are that distributed octrees require load balancing to remain efficient and, in some cases, distributed octrees must store the same object on multiple processors to build a tight fitting bounding volume for the object. \subsubsection{Load Balancing} One of the undesirable qualities of distributed octrees which makes them unsuitable for parallel collision detection is that they need to be load-balanced in order to remain efficient. Load balancing is necessary when objects become concentrated to small regions of space. When this occurs, the objects also become concentrated to a small number of octants in the octree. In the case of distributed octrees, processors managing octants where concentrations of objects exist have a disproportionately high workload. As the level of concentration increases, performance gained through parallelism decreases. For a distributed octree to remain efficient, each processor must be assigned an equal number of objects. This can be accomplished by load-balancing the distributed octree over the available processors. Load-balancing is performed using three operations: dynamic octant splitting, octant migration, and octant consolidation. Dynamic octant splitting is the process of subdividing octants at runtime to create additional levels in an octree. Dynamic octant splitting is performed by subdividing the volume of an octant into eight new octants and adding them to the octree. After which objects contained in recently split octants are inserted into the newly created octants (ref figure). The goal of dynamic octant splitting is to create additional subdivisions of a volume to further segregate objects. If dynamically splitting an octant fails to adequately partition a concentration of objects, the procedure is repeated until the objects are sufficiently segregated. Octants containing high concentrations of objects are candidates for dynamic octant splitting. (insert figure of dynamic node splitting) Octant migration is the process of moving octants from one processor to another (ref figure). To move an octant between processors, the definition of the octant and the data the octant contains are reassigned to a processor. After an octant is reassigned, all pointers that reference the octant are updated to reflect the change of management. Depending on the architecture of the parallel computer a distributed octree is implemented for, octant migration can involve as little as reassigning octant pointers, as is the case on a multi-processor, shared-memory computer. When a distributed octree is implemented for a networked cluster of computers, octant migrations involve transporting octant definitions and data across a network connection. Combined, dynamic octant splitting and octant migration can be used to evenly distribute data throughout a distributed octree. As a result of repeated dynamic octant splitting and octant migration, the structure of a distributed octree becomes fragmented over processors and memory. Fragmentation reduces the performance of octree operations and consumes excessive amounts of memory. To repair a fragmented octree, octant consolidation is used to remove unnecessary octants from the octree. Octants are consolidated based on the number of objects that exist in a branch of the octree. If the child of an octant contains a small number of objects, the objects are moved into the parent and the child is removed. This process reduces the memory consumed by the data structure, decreases the number of levels in the octree, and decreases the separation distance between octant nodes in the parallel computer. An example of a distributed octree is described in [Adaptive, Multi-resolution Visualization of Large Data Sets Using a Distributed Memory Octree]. Although not designed specifically for collision detection, the distributed octree implementation discussed in this article can also be used for parallel collision detection. A distributed octree is used for storing and manipulating scientific data sets for visualization. By distributing a data set across multiple processors, operations on data can be performed in parallel. Due to the large amounts of data, a distributed octree is well suited for this type of application. \subsubsection{Object-Octant Membership} Object-octant membership is a term that will be used to describe which octants an object intersects. The other undesirable quality of distributed octrees that makes them unsuitable for parallel collision detection is how they manage object-octant membership. The fundamental purpose of using an octree is to segregate objects based on their position. Octrees facilitate this purpose by creating a spacial relationship between objects based on their position. In octrees, a spacial relationship exists between two objects that occupy a common octant. Several methods have been developed to determine the object-octant membership of an object. In this discussion, a search method will be used to illustrate the problems with distributed octrees. As discussed in chapter 2, a typical strategy for inserting data into an octree is performed by using a recursive traversal of an octree beginning at the root. The search recursively descends the branches of the octree seeking octants that intersect the object. Depending on how the algorithm terminates recursive descents into the octree, upon completion of the insertion routine, a reference to the object will have been added to nodes in the octree data structure that correspond to the location and volume of the object. The octants that correspond to the nodes in which a reference to the object was added define the object-octant membership of the object. As seen in (ref fig), when the volume of an object straddles octants managed by different processors, the object must be stored by each processor that manages octants the object intersects. Storing objects on multiple processors is problematic for distributed octrees because, maintaining the state of an object requires coordinating the processors on which the object is stored. Although the difficulties surrounding distributed octrees can be overcome, the cost of the solutions in terms of performance outweigh the benefit gained from parallelism of collision detection. For this reason, distributed octrees are unsuitable for use in a parallel collision detection algorithm. \section{Linear Octree} The second data-structure considered for use in a parallel collision detection algorithm was a linear octree. A linear octree is not a tree-like data-structure, but rather an encoding scheme derived from the structure of an octree. Codes generated using a linear octree encoding scheme uniquely identify voxels positioned within an octree. These codes are used for describing three dimensional objects volumetrically based on the voxels that intersect the object. The set of codes that describes the volume of an object is called a linear octree. A linear octree is stored in list data-structure. The basic unit of a linear octree is an octal code, also called octnode code. An octal code is a sequence of numbers that, when read left to right, describe a path from the root of an octree to an octant. The numbers used in the sequence of an octal code correspond to the indices of octants, ranging from 0 to 7. The position of a number in sequence of an octal code corresponds to a level in the octree. For example, the octal code \{ 3, 1, 4 \} refers to an octant located on the third level of an octree (ref fig) and can be found by recursively descending into octants 3, followed by 1, and finally by 4. The sequence of numbers in an octal code can be conveniently stored in an unsigned integer by using the digit positions to store numbers from the octal code sequence. For example, the sequence \{ 3, 1, 4 \} can be represented by the integer 314. This encoding method was first introduced by Gargantini in [Gargantini - Linear Octrees for Fast Processing of three-dimensional objects: Computer Graphics and Image Processing, v. 20, no. 4, p 365-374] and is also described in [Set operations between linear octrees] [Samet] and [Finding the Voxels Shared by a Ray and a Linear Octree]. Linear octrees have several qualities that make them useful for parallel computing environments, the most useful of which is that a linear octree preserves the hierarchical nature of the data represented while avoiding the need for a pointer-based tree. Furthermore, the data representation of a linear octree is convenient for transporting by means of interprocess communications. Another useful quality of linear octrees is a unique property that octal codes exhibit. By sorting a list of octal codes represented as integers in ascending order, the resulting sequence is the post order traversal of an octree [samet] (ref fig). This property of octal codes is useful because it provides a method for merging disparate linear octrees into a single linear octree and is the cornerstone of the new algorithm presented in the next section. (insert fig showing post order traversal of octree) \section{Parallel Linear Octree Collision Detection Algorithm} In this section a new collision detection algorithm is proposed that uses a linear octree for solving the all-pairs collision detection problem. The algorithm has O(n lg n)/k performance, with n being the number of objects and k being the number of processors. The strategy behind the proposed algorithm is to construct the linear octrees of n objects in parallel using k processors and then merge the resulting octrees into a single linear octree on a single processor. In the description of this algorithm, the following assumptions are made: - all objects are cubes - the parallel program consists of one master process and k slave processes - each slave process either shares a single octree or all processes have an exact copy of the same octree - the number of objects is an even multiple of the number of processes \subsubsection{Algorithm outline} \begin{enumerate} \item The master process divides a list of n objects by k processes and distributes a list of n/k objects to each process. The master process then waits for all slaves to complete processing before execution is resumed. \item Each slave waits to receive a list of objects from the master. When the list of objects is received, each slave builds a linear octree for each object. The resulting octal-codes are paired with an index of corresponding objects from the main list of objects. All resulting octal-code/object-index pairs are stored in a single list. \item Each slave process sorts their list of octal-code/object- index pairs by the integer representation of octal-codes in ascending order. \item Each slave sends its list of octal-code/object-index pairs back to the master process. The slave process is complete. \item The master process receives k list of octal-code/object- index pairs from the slave processes. \item The master process merges k lists of octal-code/object- index pairs into a single list. \item The master process iterates through the list of octal- code/object-index pairs in ascending order and finds all pairs of octal-codes that share octants. The indices of objcts that correspond to octant-codes that share octants are stored in a list \end{enumerate} \subsection{Algorithm Pseudo code} The algorithm can be summarized in two pseudo code programs: one for the master process and one for the slave processes. \begin{verbatim} Master Process() Begin integer k := number of slaves address slave[ 0 to k ] := locations of slave processes cube object_list[ 0 to n ] := list of n objects Pair slave_lists[0 to k][] := NIL Pair master_list[] := NIL Pair colliding_pair_list[] := NIL octalcode smallest_octalcode := NIL // Step 1 sublist_size := sizeof(object_list[]) / k index := 0 For each x := from 0 to k { Send( slave[x], object_list[ index to (index + sublist_size) ] ) index := index + sublist_size } // Step 5 num_octal_codes := NIL For each x := from 0 to k { Wait(slave[x]) Receive(slave[x], slave_list[x]) } // Step 6 master_list[] := MergeSortedLists(slave_lists[]) // Step 7 index := 0 For( object1 := 0 ... sizeof(master_list[]) - 1 ) { For( object2 := object1 ... sizeof(master_list) ) { if( CommonPath( master_list[object1], master_list[object2] ) == true ) { colliding_pair[index] := object1,object2 index := index + 1 } else { break from inner for-loop } } End Slave Process Begin cube object_list[] := NIL octalcode linear_octree[] := NIL Pair slave_list[] := NIL address master := address of master process // Step 2 Wait( master ) Receive( master, object_list ) For each x := object_list[] { linear_octree[] := SearchOctree(object_list[x]) For each y := linear_octree[] { slave_list[] append pair(linear_octree[y],x) } } Sort( slave_list[] ) Send( master, slave_list ) End Subroutine Send( address, data ) { Send "data" to a process at "address" } Subroutine Wait( address ) { Wait for communications from the process at "address" } Subroutine Receive( address, data ) { Receive "data" from the process located at "address" } array Subroutine MergeSortedLists( "list of sorted lists" ) { Merges a "list of sorted lists" into a single "list" return "list" } Boolean CommonPath( octalcode1, octalcode2 ) { Determines if the octal codes share an octant. Two octal codes share an octant if both codes define octants on the same path through the octree. Return true if two octants are shared, otherwise return false. } array SearchOctree( cube ) { Determine the object-octant membership of an object return the octal codes of the octants this object intersects. } Sort( octant[] ) { Sorts a list of octants in ascending order } \end{verbatim} \subsubsection{Bounding Cube Representations} As this algorithm is used to solve the all-pairs collision detection problem, objects are approximated with a bounding volume. In this algorithm, bounding cubes are used. Bounding cubes are used for two reasons. First, testing for intersection between two cubes is efficient. In the worst case, intersection between two cubes can be found with a maximum of 48 comparisons. Efficient intersection testing is important because the octree search algorithm performs intersection tests between cubes extensively. Second, cubes can be stored in memory efficiently. Storage efficiency is important because it reduces the overall memory requirements of the collision detection algorithm. \subsubsection{Octree Search} To build a linear octree for an object, the slave processes search an octree for a set of octants that, combined, form a the bounding cube around the bounding cube representation of the object. \subsubsection{Octal code comparisons} The CommonPath subroutine is used to determine if two objects are colliding by determining if they share octants. The CommonPath subroutine is called "CommonPath" because octal codes that define similar paths through an octree have overlapping octants. The general algorithm for comparing two octal codes is a piecewise equivalency test between the sequences of numbers stored as octal codes. If any of the numbers between the sequences differ, then the octal codes identify disparate regions of space. Otherwise, the octal codes identify overlapping regions of space and the corresponding objects are considered for further collision testing. It is possible for two octal codes to define paths of different lengths. When this occurs, then octant indices are compared up to the maximum length of the the shorter octal code. In practice, the implementation of the general algorithm exploits the integer representation of octal codes. When two octal codes represented as integers differ in length, the longer of the two is truncated using a modulus and a subtraction operation. The octal codes can the be tested for equivalence using an integer comparison. \subsubsection{Post-order Traversal of a Linear Octree} The feature of linear octrees that makes the proposed algorithm efficient is the property of octal codes that, when sorted in ascending order, result is a list that defines a post order traversal of an octree. This feature is used to merge the linear octree representations of multiple objects into a single space. This is done by concatinating the linear octree representations of multiple objects and sorting the resulting list of octal codes. The sorted list of octal codes is used to determine all colliding pairs of objects using the CommonPath function. To search for all colliding pairs of objects, each octal code is tested for a common path with all subsequent octal codes in the list until an octal code identifies a non-overlapping region of space. At this point, no other octal codes need to be tested because no additional overlapping octants are represented in the list of octal codes. This is a result of the post-order traversal ordering of the list. \subsection{Analysis of Algorithm} On the master processor, steps 1 and 5 of the algorithm run in linear time as a function of the data distributed to and collected from the slave processes. On each slave, because every object is passed to the search subroutine, the search time becomes a constant factor. Therefore, searching for n objects in step 2 runs in linear time as a function of the number of objects. Also, on each slave, sorting n resultant octal codes takes $O(n log n)$ time in the worst case. In step 6, merging k lists of sorted octal codes is done in linear time. Finally, step 7 runs in $O(n^2)$ time in the worst case but, if objects are not allowed to remain in an intersecting state, then the running time is $O(n)$ on average as a result of the post-order traversal ordering of the list. Based on the individual run times of each step, the runtime performance of this algorithm will be bound by the sort performed by each slave processor. Therefore, the expected runtime of this algorithm is $O( n log n ) / k$.