\chapter{Parallel First Stage Collision Detection Algorithm}

\section{Introduction}
	\hsp{8mm} 
        As presented in chapter 2, collision detection is a
        multi-faceted problem requiring individual solutions to 
        subsets of the collision detection problem. Modern collision
        detection algorithms use a collection of algorithms to solve
        subset problems. The relationship between subset algorithms 
        can be conveniently described as a pipeline. As cited in
        [optimizing the collision detection pipeline], in order to
        achieve fast collision detection, each stage of the pipeline
        must be optimized.

        To support of the goal of providing algorithms suitable for
        use in a collision detection pipeline, this chapter presents
        a new octree-based first stage collision detection algorithm.
        The goal of this chapter is to demonstrate a solution to
        the all-pairs collision detection problem with a serial
        performance equivalent to the fastest existing first stage
        algorithms and parallel performance that delivers perfect
        speedup over k processors.

        This chapter will be organized into three sections. First,
        a description of the first attempt at developing a parallel
        first stage algorithm is presented.  Second, a description
        of the octree data structure used to facilitate a new algorithm
        will be described. Finally, the new algorithm is be presented.

\section{Distributed Octrees}
	\hsp{8mm} 
        As cited in [samet - Quadtree and Hierarchical Data
        Structures], octrees are an inherently parallel data structure.
        Due to their hierarchical nature, octrees can be easily be
        distributed across multiple processors and memory locations.
        This can be done by assigning ocrtree subdivisions, called
        octants, to separate processors for local management.  This
        type of distribution results in multiple processors
        micro-managing their own octrees which are themselves part
        of a larger octree. Distributing an octree this way is
        several desirable for several reasons.  First, decomposition
        of the hierarchy is conceptually easy to understand and
        implement as using a pointer based as octree described in
        chapter 2. Second, distribution of sub-hierarchies across a
        network of computers in conceptually the same as across
        processors in a shared memory machine.  Third, when distributed
        across a network of machines, the aggregate size of the
        data structure can exceed the memory capacity of a single
        machine. This allows for deeper octrees to be constructed
        for representing larger volumes with finer granularity.
        Fourth, distribution of work load is implicit with the
        insertion and movement of data. As data is inserted into
        the tree, the data is implicitly assigned to processors
        managing the region of space the data occupies.  Finally,
        the complexity of distributing the octree across processors
        does not grow as function of the number of processors,
        again, this is due to the hierarchical nature of octrees.

	The physical layout of this type of octree can be visualized as follows.

       \begin{figure}
        \epsfxsize=4in
        \vspace{2in}
        %\centerline{\epsffile{figs/rconst.ps}}
		(insert picture of octree sub-hierarchies on processors)
        \caption{Run times ($r=20$, $n$ is varied)} \label{fig:fig2}
        \end{figure}


        To support traversal of the octree hierarchy, processors
        are given "pointers" to parent or child octants managed
        that reside on another processor. Traversal of a distributed
        octree is done almost exactly as with a normal octree. The
        difference occurs when a path bridges octants between two
        processors. In this case, the pointer is used to determine
        which processor manages the destination octant and traversal
        is continued by the destination processor.

\section{Drawbacks of Distributed Octrees}

	\hsp{8mm} 
        Distributed octrees was the first data structure explored
        in this project in an attempt to develop  a parallel collision
        detection algorithm. However, experiments with the distributed
        octree revealed several drawbacks showing it to be unsuitable
        for parallel collision detection.

        Paramount of all issues making distributed octrees unsuitable
        for parallel collision detection is the issue of object
        distribution.

        The issue of object distribution becomes a concern when
        object become tightly concentrated to a region within the
        space enclosed by a distributed octree. These circumstances
        cause processors managing octants where concentrations of
        objects reside to have a disproportionately high workload
        compared to other processors in the system.  This in turn
        causes an overall degradation in performance as the time
        to perform collision detection converges on the time needed
        for the heaviest loaded processor to complete. In the worst
        case, all objects enter an octants managed by the same
        processor causing the entire system to degrade to serial
        performance. To avoid this situation load balancing schemes
        can be used. Octree load balancing operations typically
        involve dynamic octant splitting (splitting octants at
        runtime) and octant migration (moving octants between
        processors) [Freitag]. However, the costs associated with
        load balancing can be high, especially when the octree is
        distributed over a network of computers.

        An example of this type of distributed octree is presented
        in [Adaptive, Multiresolution Visualization of Large Data
        Sets Using a Distributed Memory Octree]. Although not
        designed specifically for collision detection, the algorithms
        and implementation issues addressed by [Freitag] are exactly
        the same if implemented for distributed collision detection.
        In [Adaptive, Multiresolution Visualization of Large Data
        Sets Using a Distributed Memory Octree] a distribute octree
        implemented for a massively parallel processor (MPP) computer
        is used to store scientific datasets for visualization. By
        distributing large data sets across multiple processors,
        operations on data can be performed in parallel. Due to the
        large amounts of data, a distributed octree is well suited
        for this application.

        The second issue of concern when using distributed octrees
        is how to determine object-octant membership. The purpose
        of using the octree is to determine a spatial relationship
        between objects. A relationship between two objects exist
        if both objects are members of an octant. Determining which
        octants an object is a member of is accomplished recursively
        searching the tree for octants with bounding volumes that
        intersect the object. Two different search methods can be
        used to perform this operation. In both, the search begins
        at the root of the octree. In the first, the search recursively
        descends to octants that fully enclose the object. In the
        second, the search recursively descends on all octants that
        intersect the object.  The difference between the two
        searches is that the first seeks a single octant that most
        tightly fits the object, while the second seeks a set of
        octants that combined form a tight fitting volume around
        the object. Both searches can be applied successfully in
        appropriate applications of octrees, distributed or not,
        but when applied to distributed octrees for solving N-body
        collision detection problems, both are unsuitable.

        The advantages of the first search are that it terminates
        with the fewest number of intersection tests between an object
        and octree octants and the search result can be stored 
        as a single cube. This is because only a
        single search path is followed through the tree. As a result,
        the algorithm is fast. The disadvantage is that the search almost
        never finds an optimally tight fitting bounding volume from
        the octree.  This is because octants are designed to evenly
        subdivide space, not to optimally enclose arbitrarily
        positioned objects. As objects move through octree space
        many intersections between the object and octants occur
        simultaneously. When multiple intersections exist, the tightest fitting
        octant that encloses an object is the parent of all 
        octants the object intersects. In the worst case the object
        intersects a two first level octants.  When
        this happens, the tightest bounding fit is the cube defined
        at the root of the octree. In octree space, this intersection
        is defined along the primary Cartesian axis.

	\begin{figure}
        \epsfxsize=4in
        \vspace{2in}
        %\centerline{\epsffile{figs/rconst.ps}}
( insert picture of object intersecting primary axis using first search method )
       \caption{Run times ($r=20$, $n$ is varied)} \label{fig:fig2}
        \end{figure}


        The advantage of the second search is that it finds the
        tightest bounding fit that an octree's resolution can produce.
        This is the same algorithm used in CSG to build a voxel
        representation of an object. The disadvantages of this
        search are that many comparisons between objects and octants
        are made and multiple octants are used to enclose the object.
        This is because all paths through the octree containing
        intersecting octants are followed and stored. As a result
        this search is both much slower and the results require
        more memory to store than required by the first search.

      \begin{figure}
        \epsfxsize=4in
        \vspace{2in}
        %\centerline{\epsffile{figs/rconst.ps}}
	(insert picture of object in octree using second search method )
       \caption{Run times ($r=20$, $n$ is varied)} \label{fig:fig2}
        \end{figure}


        When either of these searches are used in conjunction with
        a distributed octree for N-Body collision detection, problems
        with their implementation develop. In the first method,
        because an optimal bounding volume is almost never found,
        the all-pairs problem can't be efficiently solved. This is
        especially true when the bounding volume determined is the
        root of the octree. In the second method, although a tight
        fitting bounding volume is determined, problems with state
        management between processes develop. Because multiple
        octants are used to enclose an object, its possible for the
        octants enclosing the object to be spread across many
        processors. Therefore, any change of state the object
        undergoes must be propagated to all processors that have a
        record of the object. Although this problem is conceptually
        solvable, it adds another layer of complexity on top of
        octree load balancing.

       \begin{figure}
        \epsfxsize=4in
        \vspace{2in}
        %\centerline{\epsffile{figs/rconst.ps}}
	(insert picture of object in octree across multiple processors)
       \caption{Run times ($r=20$, $n$ is varied)} \label{fig:fig2}
        \end{figure}


\section{Evaluation of project direction}

	\hsp{8mm} 
        One of the challenges facing designers who use octrees is
        how to realize benefits of their use while avoiding pitfalls
        that surface in their implementation. After evaluating
        distributed octrees and discovering the problems that arise
        in their implementation, a clear picture of how octrees can
        be successfully applied to applications was formed. Distributed
        octree are suitable for applications that need extent their
        capabilities of managing volumes of space beyond memory and
        processor bounds of a single computer. For example, to
        perform collision detection between n objects where the
        space required to store n objects exceeds the memory capacity
        of a single machine, a distributed octree can be used to
        partition the space the objects occupy. Then, after any
        load balancing operations have completed to equally
        distribute the number of objects across available processors,
        an existing fast serial N-Body collision detection algorithm
        can be run by each processor.

        However, the goal of this project was to develop a parallel
        N-Body algorithm that competes with existing serial algorithms.
        Therefore, it was decided that distributed octrees are 
        inappropriate for this application.

        Fortunately, while pursuing a distributed octree, alternative
        methods of representing octrees were explored. One of these
        methods eventually led to the foundation of the new algorithm.

\section{Linear Octree}

	\hsp{8mm} 
        While trying to implement an octree that can be conveniently
        shared between processes, alternatives to pointer-based
        octrees where explored. One of the implementations explored
        is a linear octree.

        Linear octrees offer an alternative pointer-based trees for
        storing data in an octree. Consider an octree where each
        octant is assigned a unique key which identifies it's
        position and size. These keys, called octal codes, can then
        be assigned to data that are members of the corresponding
        octants. A collection of octal codes s the linear octree
        representation of data.

        Octal codes are constructed by first assigning labels to
        the eight possible octant positions from 0 to 7. A full
        encoding of an octant is built by recursively descending
        from the root to the octant. Along the way, the labels of
        octants traversed along the path are stored in a list. The
        resulting sequence is the octal code and defines a root-based
        path through the octree. The size of the destination octant
        can be inferred from the octal code as a function of it's
        length and the size of the octree root.

        The sequence of numbers in an octal code can be conveniently
        stored in an unsigned integer where, each digit position of
        the integer stores an octant label. Using integers obviously
        limits the depth of an octree that can be encoded but, as we will
        see shortly, there are compelling reason to use integers.

        This type of encoding was first introduced in [Gargantini
        - Linear Octrees for Fast Processing of three-dimensional
        objects: Computer Graphics and Image Processing, v. 20, no.
        4, p 365-374] and is also described well in [Set operations
        between linear octrees], [Samet] and [Finding the Voxels
        Shared by a Ray and a Linear Octree].

        An interesting property of octal codes exhibit is that, when
        sorted in ascending order, the resulting sequence is a
        postorder traversal of the octree [samet]. Although [samet]
        makes this observation about quadtrees, quadtrees are closely
        related to octrees and the principal applies equally well.

        The implication of this property is that, given a set of
        objects, the order these objects would be found in an octree
        by successively searching each path can be determined by
        sorting their corresponding octal codes.

\section{Parallel Octree Based N-Body Collision Detection Algorithm}

	\hsp{8mm} 
        While trying to implement an octree that can be conveniently
        Learning from the mistakes made while pursuing an efficient
        parallel distributed octree, a more successful approach to
        applying octrees to the N-body collision detection problem
        was made possible. 

        The basic idea behind the new N-body collision detection
        algorithm is to use an octree only for establishing the spatial
        relationship of objects, not as a database for objects.
        This is done by adopting a CSG interpretation of octrees
        and using them coarse bounding representations
        of objects. CSG representations are then stored
        as a linear octree. The spatial relationship between objects
        can then determined by comparing octal codes. This process
        is done in three steps:

	\begin{enumerate}
	\item Construct CSG representations of N objects 
                using a full octree. The resulting list of octal  
                codes is the linear octree representation of the objects
	\item Sort the list of octal codes
	\item Iterate through the list codes comparing each
	        code with successive octant codes until a stopping
      		condition occurs.
	\end{enumerate}

	Stopping condition are:
	\begin{enumerate}
	\item the end of the list has been reached
	\item the octant codes define divergent paths through the octree
	\end{enumerate}

\subsection{Step 1: Modified CSG algorithm and octree search stopping conditions}

	\hsp{8mm} 
        Step 1 of the algorithm is used to establish the region of
        space an object occupies in the octree.  This is done by
        searching the octree for octants that intersect the object.
        Important aspects of this step include the type of object
        bounding representation used in the octree search, the type
        of search performed, and the stopping conditions of the
        octree search.

	\subsubsection{Cube Bounding Representations}
	\hsp{8mm} 
        As described in chapter 2, first stage collision detection
        algorithms often approximate objects with bounding
        representations. This algorithm uses bounding cubes 
        to approximate objects when searching an octree. Cubes
        are used for several reasons:

		\begin{enumerate}
		\item testing for intersection between octants, which are also
                cubes, and bounding cubes is fast

		\item cubes are convex and map directly to the
                hierarchical structure of the octree. This guarantees
                that the set of all points bound by the cube's volume 
                are also bound by.

		\item cubes can be stored efficiently in memory; 4 real
                numbers, 3 defining it's center, 1 defining the
                length of it's sides.
		\end{enumerate}

	\subsubsection{Octree Search}
	\hsp{8mm} 
        Previously discussed in this chapter is the issue of
        object-octant membership and two search algorithms used
        to perform the object insertions. The first search algorithms
        seeks a single octant that bounds the object while the
        second builds an approximation of the object's features using
        many octants. In this algorithm, a modified version of the
        latter search is used to build a tightly fitting bounding
        region around an object. Because cubes are being inserted
        into the octree, this algorithm terminates when the tightest
        possible cube is built around the object using the minimum
        number of octants.

	\subsubsection{Octree Search Stopping Conditions}
	\hsp{8mm} 
        This algorithm builds a tight fitting cube by, beginning
        at the root, recursively descending on all paths through the
        octree that intersect the object. Termination of the search
        happens when one of two conditions occur: leave octants are
        reached or the next recursive step results in 64 octants
        intersecting the object.  The first condition obviously
        stops the search because no more levels on the path exist.
        The second condition prevents intersection tests on the
        interior of objects. When condition occurs, the tightest
        bounding box that fully encloses the object has been
        established.  As a corollary result, when a search terminates
        as a result of this test, the bounding fit always results
        in eight octants; the parents of the 64 intersecting octants.
        A similar observation is made in [Collision Detection Between
        N Spheres].

        To visualize how this algorithm works, rather than thinking
        of an object being inserted into the tree, think of the
        octree closing in on the object.

       \begin{figure}
        \epsfxsize=4in
        \vspace{2in}
        %\centerline{\epsffile{figs/rconst.ps}}
	(insert picture of octree successively closing in on object)
       \caption{Run times ($r=20$, $n$ is varied)} \label{fig:fig2}
        \end{figure}


\subsection{Step 2:}
	\hsp{8mm} 
        Step 2 of the algorithm sorts this list of octal code,
        which, as previously described, results in a post order
        traversal of the octree. A description of how this is used
        is presented in step 3.

\subsection{Step 3:}
	\hsp{8mm} 
        Step 3 of the algorithm determines the spatial relationship
        between objects by comparing their octal codes. For the
        N-Body collision detection problem, the only relationship
        of interest is when two objects are sharing the same space.
        This is commonly called interference and, in dynamic systems,
        is equivalent to collision.

	\subsubsection{Octal code comparisons}
	\hsp{8mm} 
        Although octal codes are stored as integers, comparison
        between codes requires more than a simple integer comparison.
        As previously described, octal codes record a paths through
        an octree. Therefore, to determine if two paths are equivalent
        a piece-wise comparison between their respective components
        is used.  Comparison begins with the lowest octree level,
        which in this case is level 1; all octal codes share level
        0, the root, so no space is wasted on recording it. Like
        level components are sequentially tested for equivalence
        until one of two conditions occur:

		\begin{enumerate}
		\item two components from the octals are not equivalent;
                the paths diverge.

		\item one or both of the codes run out of components
                to compare; the codes define octants that lie on
                the same path, but not necessarily on the same level.
		\end{enumerate}

        If the octal codes describe to paths that diverge, then the
        objects obviously do not terminal octants. 

        If one or both codes run out of components, then the
        paths are equivalent up to the point one or both terminate.
        In this case, the objects share a common octant, but not
        necessarily on the same level in the octree. When two codes 
        are equivalent, except for their lengths, then the longer code 
        identifies an octant interior to the shorter code. Therefore,
        the octants share space and the objects they contain could 
        interference.

	\subsubsection{Linear octree traversal}
	\hsp{8mm} 
        To efficiently determine all equivalent octal codes, the
        post-order traversal ordering of the linear octree made in
        step 2 is exploited. Beginning at the front of the list,
        each octal code is tested against all successive codes until
        a code defining a divergent path is evaluated. Due to the
        post-order traversal ordering of the octal code list, once
        an octal code tested diverges from the current octal code,
        it is guaranteed that all remaining codes in the list diverge
        from the current code.

\subsection{Parallelism}

	\subsubsection{Parallelizing Step 1}

	\hsp{8mm} 
        Step 1 of the serial algorithm is made highly parallelizable
        for two reasons. First, the octree used in this algorithm
        is completely static, unlike other implementations that use
        the octree as a database. Therefore, multiple processes can
        operate on the same octree simultaneously without interfering
        with one another. Second, the results of step 1 are sorted
        in step 2. This makes the results produced by multiple
        processors in step 1 completely mergeable.  To parallelize
        step 1, it is assumed k processors are use. The first
        processors is used as a master controller which drives the
        algorithm and the remaining processors are slaves.

	The parallel form of step 1 expands to three steps:

		\begin{enumerate}
		\item the master processor partitions a list of
                objects by the number slave processors available. Sublists
                are distributed to each processor.

		\item slave processors perform step 1 from the original
                algorithm on their local list of objects. A local list
                of result octal codes is produced.

		\item the master processor merges the resultant octal code 
                lists from the slave processors
		\end{enumerate}

	The original algorithm continues at step 2 on the master processor.

        Because the list of objects in being divided by the number
        of processors available, it is easy to see that work is
        alway divided evenly throughout the duration of the algorithm.
        No additional load balancing is needed. Also, because no
        additional operations are done between the time object
        sublists are distributed and the time resultant octal code
        lists are collected, perfect speedup over k processors is
        expected.

	Pseudo code for the entire algorithm including the parallelized step 1 is: