\documentstyle[11pt,epsf]{article}
%12pt,fleqn,epsf]{thesis}

%\pagestyle{normal}
\setlength{\textwidth}{6.6in}
\setlength{\textheight}{9.1in}
\setlength{\topmargin}{-0.5in}
\setlength{\oddsidemargin}{-.0in}
\setlength{\parindent}{1em}
\input{psfig}

\begin{document}
\bibliographystyle{plain}

\vspace*{2in}
\begin{center}
{\LARGE \bf A Parallel Algorithm for Solving \\
\vspace{6pt}
a Tridiagonal Linear System with the ADI Method}

\vspace{15pt}
\thispagestyle{empty}
\large
\begin{tabular}{cc}
Lixing Ma & Frederick C. Harris, Jr.      \\
Dept. of Computer Science & Dept. of Computer Science  \\
Univ. of Nevada         & Univ. of Nevada       \\
Reno, NV 89557          & Reno, NV 89557       \\
ma\_l@cs.unr.edu        & fredh@cs.unr.edu    \\
\end{tabular}
\end{center}

\vspace{1in}
\begin{center}
{\huge Corresponding Author:}
\vspace{10pt}

\noindent
{\large
Frederick C. Harris, Jr.\\
Dept. of Computer Science \\
University of Nevada \\
Reno, NV 89557\\
fax: (775) 784-1877 \\
email: fredh@cs.unr.edu \\
phone: (775) 784-6571
}
\end{center}

\vspace{\fill}

\noindent
{\large This is a full paper} \\
{\large We believe that this paper will fall under
Category 2: Parallel and Distributed Algorithms.}

\newpage
\setcounter{page}{1}
%\thispagestyle{empty}
\baselineskip=24pt
\subsection*{\centering Abstract}
\vspace*{-3mm}
	\hspace{1em}
	The Alternating Direction Implicit (ADI) method is widely used
	in various discretized systems. In this paper, a new parallel
	algorithm is developed to solve a system of tridiagonal linear
	equations with the ADI method for a large-scale heat conduction
	problem. The theoretical analysis on the speedup and scalability
	is also presented. When compared with Thomas's algorithm, this
	algorithm shows a good improvement for parallel machines.

        \noindent
	{\bf keywords:} parallel algorithm, Alternating Direction Implicit
	method, ADI

\section{Introduction}

	\hspace{1em}
	The unsteady, three-dimensional heat conduction equation, can
	be written in three spatial dimensions as follows:
	\begin{equation}
	\frac{\partial T}{\partial t} = \alpha \left( 
		\frac{\partial^{2} T}{\partial x^{2}} +
		\frac{\partial^{2} T}{\partial y^{2}} +
		\frac{\partial^{2} T}{\partial z^{2}}
		\right)
	\label{eq:1}
	\end{equation}
	If the Crank-Nicolson method is used to discretize the above
	equation, it will contain seven unknowns $T_{i,j,k}^{n+1}$,
	$T_{i+1,j,k}^{n+1}$, $T_{i-1,j,k}^{n+1}$, $T_{i,j+1,k}^{n+1}$,
	$T_{i,j-1,k}^{n+1}$, $T_{i,j,k+1}^{n+1}$ and $T_{i,j,k-1}^{n+1}$,
	where the last four unknowns prevent us  from putting it into a
	standard tridiagonal form~\cite{Patankar}, and Thomas's algorithm
	can not be directly used. There are some matrix methods and
	iterative methods that can  be used to solve this discretization
	equation, but the computation time is much longer than that of a
	tridiagonal system. As a result, there is a distinct advantage
	in developing a scheme that will allow Equation (\ref{eq:1})
	to be solved by means of tridiagonal form. Such a scheme is the
	Alternating Direction Implicit (ADI) method~\cite{Anderson}.

	The ADI method usually introduces three time step terms
	$(t+\frac{1}{3} \Delta t, t+\frac{2}{3} \Delta t, t+\Delta t)$ for
	a three-dimensional problem and two time step terms $(t+\frac{1}{2}
	\Delta t, t+\Delta t)$ for two-dimensional problem. In this
	manner, 
	Equation~(\ref{eq:1}) can be discretized as three tridiagonal forms:
	\begin{equation}
	\frac{T_{i,j,k}^{n+\frac{1}{3}}-T_{i,j,k}^{n}}{\frac{\Delta t}{3}} 
		= \alpha \left(
		\frac{\partial^2 T^{n+\frac{1}{3}}}{\partial x^{2}} +
		\frac{\partial^2 T^{n}}{\partial y^{2}} +
		\frac{\partial^2 T^{n}}{\partial z^{2}}
	\right)
	\label{eq:2a}
	\end{equation}
	\begin{equation}
	\frac{T_{i,j,k}^{n+\frac{2}{3}}-T_{i,j,k}^{n+\frac{1}{3}}}
		{\frac{\Delta t}{3}} 
		= \alpha \left(
		\frac{\partial^2 T^{n+\frac{1}{3}}}{\partial x^{2}} +
		\frac{\partial^2 T^{n+\frac{2}{3}}}{\partial y^{2}} +
		\frac{\partial^2 T^{n+\frac{1}{3}}}{\partial z^{2}}
	\right)
	\label{eq:2b}
	\end{equation}
	\begin{equation}
	\frac{T_{i,j,k}^{n+1}-T_{i,j,k}^{n+\frac{2}{3}}}
		{\frac{\Delta t}{3}} 
		= \alpha \left(
		\frac{\partial^2 T^{n+\frac{2}{3}}}{\partial x^{2}} +
		\frac{\partial^2 T^{n+\frac{2}{3}}}{\partial y^{2}} +
		\frac{\partial^2 T^{n+1}}{\partial z^{2}}
	\right)
	\label{eq:2c}
	\end{equation}
	where $\partial^2 / \partial x^{2}$, $\partial^2 / \partial
	y^{2}$, and $\partial^2 / \partial z^{2}$ are the central
	differences in space. Equations (\ref{eq:2a}), (\ref{eq:2b}),
	and (\ref{eq:2c}) always have only three unknowns, which
	allows for construction of a tridiagonal matrix. First,
	Equation~(\ref{eq:2a}) is used to solve $T_{i,j,k}^{n+1/3}$ in the
	$x$-direction according to the initial condition or the results
	of the last time step. Then, $T_{i,j,k}^{n+1/3}$ is used in
	Equation~(\ref{eq:2b}) to solve  $T_{i,j,k}^{n+2/3}$ in the
	$y$-direction. Finally, $T_{i,j,k}^{n+1}$ in the $z$-direction is
	solved according to Equation~(\ref{eq:2c}). This is an overview
	of the Alternating Direction Implicit (ADI) method. Generally,
	these three equations can be represented in a linear algebraic
	equation, namely:
	\begin{equation}
	a_{i}x_{i-1}+b_{i}x_{i}+c_{i}x_{i+1} = f_{i} 
		\hspace{0.5in} i=1,...,n-1
	\label{eq:3}
	\end{equation}

	The ADI scheme is unconditionally stable and has second-order
	accuracy with a truncation error of ${\cal O}[(\Delta t)^2,
	(\Delta x)^2, (\Delta y)^2]$ for two dimensions and is
	conditionally stable and with a truncation error of ${\cal
	O}[(\Delta t), (\Delta x)^2, (\Delta y)^2, (\Delta z)^2]$
	for three dimensions~\cite{Kollman}. The other advantage of
	the ADI scheme is that Thomas's algorithm can be used to find 
	exact solution of Equation~(\ref{eq:3}).

	There are two major parallel algorithms that are widely used for
	the ADI scheme. One is the domain decomposition algorithm\cite{Wang},
	and the other uses higher dimensional linear recurrences
	to solve the 
	diatriangular 
	linear system Equation
	(\ref{eq:3})\cite{WangVanka}.

	In the domain decomposition algorithm, the main domain is divided
	into many subdomains, where the number of subdomains is usually
	chosen to be a power of 2. The data in a subdomain depend on
	other domains only through the inner boundaries between those
	domains. Within a subdomain, the nodal data have a well-defined
	interdependency. A more effective approach is to localize
	the line inversion algorithm and implement a bona fide domain
	decomposition. If one can break the line inversion across the
	whole domain into local line inversions within the subdomains and
	remove the interprocessor data dependency, then each processor can
	work on the local line segment and the execution can be carried
	out in parallel.  With such a parallel line inversion algorithm,
	it is possible to achieve domain decomposition and parallel
	processing by dividing the computational domain into subdomains
	such as the division presented in Figure~\ref{fig:fig1}.

	Figure~\ref{fig:fig1} shows a domain decomposed into $4 \times
	4$ subdomains with four processors in parallel. When scanning in
	the horizontal direction, each processor combines four horizontal
	subdomains to solve Equation (\ref{eq:3}) to obtain a new  $x_{i}$
	or $T_{i}$. Based on those values, Equation~(\ref{eq:3})
	is then solved in the vertical direction in parallel.

        \begin{figure}
	%\vspace{3in}
        \epsfxsize=4in
        \centerline{\epsffile{fig-1c.ps}}
        \caption{Domain decomposition based on localized inversion.} 
	\label{fig:fig1}
        \end{figure}

	Because the boundary node value of each horizontal domain or
	vertical domain is always taken from previous data, it will take
	more iterations to converge if massively parallel processors
	are used. This is the same problem that causes the Gauss-Siedel
	method to be replaced by SOR.

	In the second algorithm, the main domain is not 
	divided into subdomains. The goal is to provide a parallel
	algorithm for solving the 
	diatriangular 
	linear system in
	Equation~(\ref{eq:3}),
	which also can be presented in matrix form as:
	\begin{equation}
	Ax = f
	\label{eq:3a}
	\end{equation}
	where $x$ and $f$ are $n$-dimensional vectors and $A$ is the 
	following $n \times n$ tridiagonal matrix:

	\begin{equation}
	A = \left[
	\begin{array}{cccccccc}
	b_{1} & c_{1} & 0 & ... & ... & ... & ... & 0 \\
	a_{2} & b_{2} & c_{2} & 0 & ... & ... & ... & 0 \\
	0 & a_{3} & b_{3} & c_{3} & 0 & ... & ... & 0 \\
 	... & ... & ... & ... & ... & ... & ... & ... \\
	0 & ... & ... & ... & 0 & a_{n-1} & b_{n-1} & c_{n-1} \\
	0 & ... & ... & ... & ... & 0 & a_{n} & b_{n} \\
	\end{array}
	\right]
	\label{eq:4}
	\end{equation}

	\vspace{12pt}

	Matrix $A$ can be decomposed into an $LDU$ factorization form. 
	That is, you can determine a unit lower triangular matrix $L$, 
	a diagonal matrix $D$, and a unit upper triangular matrix $U$ 
	such that $A=LDU$. 
	The matrices $L$, $D$ and $U$ can be specified as follows:
	\begin{equation}
	L = \left[
	\begin{array}{ccccc}
	1 & 0 & ... & ... & 0 \\
	l_{2} & 1 & 0 & ... & 0 \\
	0 & l_{3} & 1 & ... & 0 \\
	... & ... & ... & ... & ... \\
	0 & ... & ... & l_{n} & 1 \\
	\end{array}
	\right]
	U = \left[
	\begin{array}{cccccc}
	1 & u_{1} & 0 & ... & ... & 0 \\
	0 & 1 & u_{2} & 0 & ... & 0  \\
	0 & 0 & 1 & u_{3} & ... & 0 \\
	... & ... & ... & ... & ... & ... \\
	0 & ... & ... & 0 & 1 & u_{n-1} \\
	0 & ... & ... & ... & 0 & 1 \\
	\end{array}
	\right]
	D = \left[
	\begin{array}{cccc}
	d_{1} & 0  & ... & 0 \\
	0 & d_{2} & ... & 0 \\
	... & ... & ... & ... \\
	0 & ... & 0 & d_{n} \\
	\end{array}
	\right]
	\label{eq:5}
	\end{equation}
	where
	\begin{equation}
	\begin{array}{lcll}
	d_{1}	& = & b_{1}  \\
	\\
	d_{j} 	& = & b_{j} - a_{j}c_{j-1} / d_{j-1}, 	
			 & 2 \leq j \leq n \\
	\\
	l_{j}	& = & a_{j}/d_{j-1},
			& 2 \leq j \leq n \\
	\\
	u_{j}	& = & c_{j}/d_{j}, 
			& 1 \leq j \leq n-1 
	\end{array}
	\label{eq:6}
	\end{equation}

	\vspace{12pt}

	From Equation (\ref{eq:6}), it is clear that the values of
	$l_{j}$ and  $u_{j}$ can be deduced immediately from the values
	of $d_{j}$. For convenience of parallelization, let $d_{j}=
	w_{j}/w_{j-1}$, where $w_{0}=1$ and$w_{1}=b_{1}$. Substituting
	$d_{j}=w_{j}/w_{j-1}$ and $d_{j-1}=w_{j-1}/w_{j-2}$ in the
	equation $d_{j}=b_{j}-a_{j}c_{j-1}/d_{j-1}$, the sequence
	$w_{j}$ is given by the following second-order linear recurrence:
	\begin{equation}
	\begin{array}{lcl}
	w_{0} & = & 1 \\
	\\
	w_{1} & = & b_{1} \\ 
	\\
	w_{j} & = & b_{j}w_{j-1}-a_{j}c_{j-1}w_{j-2}, 
		\hspace{0.3in} 2 \leq j \leq n \\
	\end{array}	
	\label{eq:7}
	\end{equation}

	\vspace{12pt}
	After the sequence $w_{j}$ is obtained, all of the values in
	matrices $L$, $D$, and $U$ can be obtained through the relation in
	Equation (\ref{eq:6}). The solution to the linear system $Ax =
	b$ can be expressed by the following recurrence:
	\begin{eqnarray}
	%\begin{array}{lcl}
	x_{1} & = & \frac{b_{1}}{a_{11}} \nonumber\\
	%\nonumber \\
	x_{i} & = & \sum_{j=i-m+1}^{i-1} - \frac{a_{ij}}{a_{ii}}x_{j}
		+ \frac{b_{i}}{a_{ii}}, \hspace{0.3in} 2 \leq i \leq n 
	%\end{array}
	\label{eq:8}
	\end{eqnarray}	

	\vspace{12pt}
	Solving this system of equations with a Divide and Conquer 
	approach (DAC) improved
	this algorithm. Using DAC, the whole linear tridiagonal equation
	system is divided into $k$ subsystems with a matrix transformation
	on matrix $A$. Then $P$ processors are assigned to deal with each
	subsystem in parallel. This method is complicated and needs
	many transforms to take into account the effect of the internal
	boundary node. To find the final solution of one linear system,
	you have to solve several linear systems first. Therefore in
	this paper, a new parallel algorithm is introduced to solve the
	linear system $Ax = b$.

\section{Modified Cyclic Reduction}

   \subsection{Cyclic Reduction}

	\hspace{1em}
	Harold Stone presents the Cyclic Reduction method for solving
	tridiagonal equations in his book~\cite{Stone}. 
	His method works extremely
	well for the Poisson matrix whose diagonals contain only 1s and
	-2s. The idea behind cyclic reduction is to sum three consecutive
	equations as indicated here:
	\begin{equation}
	\begin{array}{lcl}
	x_{i-2} - 2x_{i-1} + x_{i} & = & f_{i-1} \\
	\\
	x_{i-1} - 2x_{i} + x_{i+1} & = & f_{i} \\
	\\
	x_{i} - 2x_{i+1} + x_{i+2} & = & f_{i+1} \\
	\end{array}
	\label{eq:9}
	\end{equation}

	\vspace{12pt}

	Adding the first and third equations with the second equation
	multiplied by two, $x_{i-1}$ and $x_{i+1}$ are eliminated, and
	Equation (\ref{eq:9}) becomes one equation as indicated here:
	\begin{equation}
	x_{i-2} - 2x_{i}-x_{i+2} = f_{i-1}+2f_{i}+f_{i+1}
	\label{eq:10}
	\end{equation}

	To solve the full system, all of the equations can be reduced to
	Equation (\ref{eq:10}) for the first kind of boundary condition
	or into the three equations in (\ref{eq:9}) for the second kind of
	boundary condition. Then, back substitution is used to solve all
	unknowns. Both the reduction and back
	substitution steps can be done in parallel.

	Unfortunately, this scheme works only for the steady, isotropic
	and uniform space step problem as described in Equation
	(\ref{eq:1}), because the parameter matrix does not contain just
	1s and -2s. The important thing to note is that the cyclic
	reduction method can be modified and the idea can be generalized
	to other types of matrices.

   \subsection{Modified Cyclic Reduction(MCR)} 
	
	\hspace{1em}
	The general form of a tridiagonal matrix used in the ADI method is
	represented in Equation (\ref{eq:4}). The three consecutive
	equations are as follows:
	\begin{equation}
	\begin{array}{lcl}
	a_{i-1}x_{i-2} + (-b_{i-1}x_{i-1})+c_{i-1}x_{i} & = & f_{i-1} \\
	\\
	a_{i}x_{i-1} - b_{i}x_{i} + c_{i}x_{i+1} & = & f_{i} \\
	\\
	x_{i+1}x_{i}-b_{i+1}x_{i+1}+c_{i+1}x_{i+2} & = & f_{i+1} \\
	\end{array}
	\label{eq:10b}
	\end{equation}

	\vspace{12pt}

	In order to eliminate two unknowns $x_{i-1}$ and $x_{i+1}$,
	we sum the first equation multiplied by $a_{i}/b_{i-1}$, the
	third equation multiplied by $c_{i}/b_{i+1}$ and the second
	equation. After rearranging, Equation~(\ref{eq:10b}) becomes
	\begin{equation}
	a_{i}^{(1)}x_{i-2}-b_{i}^{(1)}x_{i}+c_{i}^{(1)}x_{x+2} =
		f_{i}^{(1)}
	\label{eq:11}
	\end{equation}
	where 
	\begin{eqnarray*}
	a_{i}^{(1)} & = & a_{i-1}a_{i}/b_{i-1}, \\
	b_{i}^{(1)} & = & b_{i}+a_{i}c_{i-1}/b_{i-1},  \\
	c_{i}^{(1)} & = & c_{i}c_{i+1}/b_{i+1},   \ \  {\rm and}   \\
	f_{i}^{(1)} & = & (a_{i}/b_{i-1})f_{i-1} + f_{i} + 
		(c_{i}/b_{i+1})f_{i+1}.
	\end{eqnarray*}

	In order to obtain only one equation, the space steps $\Delta
	x$, $\Delta y$, and $\Delta z$ have to be adjusted in the
	discretization of the Equation (\ref{eq:1}) so that the total
	number of unknowns is equal to $2^n-1$. In addition, it is
	strongly recommended that you normalize the coefficient $b_{i}$.
	The first advantage of normalization is that the memory can be
	saved for storing $b_{i}$. The second advantage is that it helps
	you to avoid overflow if the problem has huge numbers. For an
	isotropic and uniform space step problem, $a_{i}$ and $c_{i}$
	are usually identical and are normalized more naturally. This is
	the correct choice, but it only works for $b_{i} = 2$. Because
	the matrix $A$ is diagonal dominant and $b_{i}^{(0)} =
	b$, and $b_{i}^{(i+1)} = b_{i}^{i^2}-2$, it will overflow after
	several steps of reduction. For a typical Poisson equation,
	$b_{i}^{(0)}$ is usually taken as 4 or 6.

   \subsection{Middle Node Algorithm for a Parallel MCR}

	\hspace{1em}
	Let us consider the first kind of boundary condition for
	the  ADI method while solving Equation~(\ref{eq:1}). In a
	specific direction, the number of nodes is equal to $2n+1$. The
	first and the last nodes are known because of the first boundary
	condition, therefore the total number of unknown nodes is equal
	to $2n-1$. This number is also equal to the dimension of the
	matrix $A$ in Equation~(\ref{eq:3}). Obviously, $n-1$ is the
	total number of times you perform cyclic reduction, at which
	time you have only one equation left. Three nodes remain in
	the last equation, $x_{0}$, $x_{2^{n-1}}$,     and $x_{2^{n}}$,
	where $x$ is counted from $0$. These are the first node, the
	middle node, and the last node. The first node and the last node
	are two known boundary values; therefore, the middle node can
	be found through the last equation. At the middle point, the
	line is cut in two parts and the same method is used to find
	two middle node values. In a similar manner, find four, 
	eight, sixteen,..., middle nodes until all the unknown node 
	values are obtained. We call this
	algorithm the Middle Node Algorithm. The basic formula of this
	algorithm is based on Equation~(\ref{eq:12}).
	\begin{equation}
	\begin{array}{rcl}
	a_{i}^{(j)}x_{i-2^{j}}-x_{i} + c_{i}^{(j)}x_{i+2^{j}} 
		& = &  f_{i}^{(j)}\\
	\\
	a_{i}^{(0)} = \frac{a_{i}}{b_{i}}, \ a_{i}^{(j)} 
		& = &  
		\frac{a_{i-2^{j-1}}^{(j-1)}a_{t}^{(j-1)}}
{1-a_{i}^{(j-1)}c_{i-2^{j-1}}^{(j-1)}-a_{i+2^{j-1}}^{(j-1)}c_{i}^{(j-1)}}\\
	\\
	c_{i}^{(0)} = \frac{c_{i}}{b_{i}}, \ c_{i}^{(j)} 
	 & = &  
		\frac{c_{i}^{(j-1)}c_{t+2^{j-1}}^{(j-1)}}
{1-a_{i}^{(j-1)}c_{i-2^{j-1}}^{(j-1)}-a_{i+2^{j-1}}^{(j-1)}c_{i}^{(j-1)}}\\
	\\
	f_{i}^{(0)} = \frac{f_{i}}{b_{i}}, \ f_{i}^{(j)} 
	 & = &  
		\frac{a_{i}^{(j-1)}f_{t-2^{j-1}}^{(j-1)}+ f_{t}^{(j-1)}
+ c_{i}^{(j-1)}f_{t+2^{j-1}}^{(j-1)}}
{1-a_{i}^{(j-1)}c_{i-2^{j-1}}^{(j-1)}-a_{i+2^{j-1}}^{(j-1)}c_{i}^{(j-1)}}\\
	\end{array}
	\label{eq:12}
	\end{equation}
	Where $j$ is the order of reduction step, $0 \leq j \leq
	n-1$. At $j=0$, the equation is almost same as 
	Equation~(\ref{eq:3}), the only difference is that 
	each $b_{i}$ is
	normalized. There are $2^{n-1}$ nodes at $j=0$. Each node also
	represents a processor, and the processors execute code for 
	Equation~(\ref{eq:12}).
	Each node or processor is assumed to have its own memory to store
	the information contained in Equation~(\ref{eq:12}), $e.g.$ the
	parameters($a$, $c$ and $f$) and boundary values or neighbor data. For
	convenience, the parallel algorithm is illustrated through the
	seven node example in Figure~\ref{fig:fig2}.

        \begin{figure}[h]
        \epsfxsize=6in
	%\vspace{3in}
        \centerline{\epsffile{fig-2c.ps}}
        \caption{Scheme of the Middle Node Algorithm.} 
	\label{fig:fig2}
        \end{figure}

	There are two main procedures in the Middle Node Algorithm: one is
	reduction and the other is middle node solving. Let us look at the
	reduction procedure first. At the level $j=0$, seven processors
	normalize their own equations in parallel, and a new $a$, $c$, 
	and $f$ are obtained. Before entering the second
	level, this data ($a$, $c$, and $f$) is sent in parallel from P1, P3,
	P5 and P7 to all their neighbor processors. P1 and P7 have only
	one neighbor, and P3 and P5 have two neighbors each. In level $j=1$,
	P2, P4 and P6 calculate the new $a$, $c$, and $f$ based on the
	data from level $j=0$. Then, P2 and P6 send $a$, $c$, and $f$
	to P4. At the highest level $j=2$, the procedure of reduction
	is completed after $a$, $c$, and $f$ are calculated by P4.

	In most implementations of the ADI method, the boundary
	values $x_{0}$ and $x_{8}$ are usually combined into the
	right hand side (the $f$ term) of Equation (\ref{eq:3}) in the
	discretization of node 1 and node 7. But $x_{0}$, $x_{8}$ and
	their coefficients still remain in the left hand side of the Equation
	(\ref{eq:3}). The advantage is to guarantee the last equation
	is solvable. Otherwise, we need to solve three equations at the
	highest level.

	In the middle node solving procedure, P4 uses the first equation
	of Equation (\ref{eq:12}) to find  $x_{4}$ and sends the value
	to P2, P6, P3, and P5 at the level $j=2$. P2 and P6 find $x_{2}$
	and $x_{6}$, respectively, in parallel. Then P2 sends $x_{2}$ to
	its neighbors P1 and P3, and P6 sends $x_{6}$ to P5 and P7. At
	the lowest level $j=0$, P1, P3, P5 and P7 find $x_{1}$, $x_{3}$,
	$x_{5}$, and $x_{7}$ in parallel. At this point, all the unknowns
	in Equation (\ref{eq:3}) are solved.

	For the second kind of boundary condition, the linear system
	is described in Equation~(\ref{eq:3}). Because the first and
	last nodes are also unknowns, three equations have to be solved
	in the final step. To do so, we use the
	Crown Algorithm.  The Crown Algorithm uses $2^{n+1}$ processors
	and is illustrated in Figure~\ref{fig:fig3}. The difference
	between the Middle Node Algorithm and the Crown Algorithm is in
	the reduction procedure. They are exactly same 
	after the first three equations are solved. The
	Crown Algorithm will not be discussed further in this paper.

        \begin{figure}[h]
        \epsfxsize=6in
	%\vspace{3in}
        \centerline{\epsffile{fig-3c.ps}}
        \caption{Scheme of the Crown Algorithm for secondary boundary
		condition.} 
	\label{fig:fig3}
        \end{figure}

	Sometimes, the number of processors is less than the number
	of unknown nodes. This is often encountered when we use
	PVM~\cite{PVM:book} or MPI~\cite{mpi:book,mpi:pacheco,mpi:ref}
	to link several workstations together as a parallel virtual
	machine. Figure~\ref{fig:listing} contains the pseudo code for
	when there is a difference between the unknown node number and
	the processor number.

\begin{figure}
{ \baselineskip=12pt \footnotesize

\underline{\hspace{5.5in}} \\
{\bf input:} \\
   \hspace*{1em} N --- number of unknown nodes \\
   \hspace*{1em} P --- number of processors \\
   \hspace*{1em} a[i], c[i], f[i] --- normalized matrix parameters for 
	unknown node i. i=1,2 ... N \\
   \hspace*{1em} x[0], x[N+1] --- boundary node value from boundary 
	condition \\
{\bf output:} \\
   \hspace*{1em}find x[1], x[2], ..., x[N] \\
\underline{\hspace{5.5in}} \\
{\bf step 1:} \\
   \hspace*{1em}create P tasks(processors) and get their task\_id[i], 
	i=1, 2, ..., P \\
   \hspace*{1em} find the level of unknown nodes $ln$ and level of 
		tasks $lp$ \\
	\hspace*{2em}$l_{n} = \log_{2}(N+1)$ \\
	\hspace*{2em}$l_{P} = \log_{2}(P+1)$ \\
   \hspace*{1em} assign a, c, f and x[0], x[N+1] to all tasks \\
	\hspace*{2em} for $i=1$ to $i=P$, step=1 do \\
	\hspace*{3em} for $j=(i-1)*2^{ln-lp}+1$ to $(i+1)*2^{ln-lp}-1$, 
		step=1 do \\
	\hspace*{4em} assign a[j], c[j] and f[j] to task\_id[i] \\
	\hspace*{3em} send x[0], x[N+1] to task\_id[i]. \\
{\bf step 2:} \\
  \hspace*{1em} computing a, c, and f parallelly in each task\_id[p], 
		p=1, ..., P without data communication \\
	\hspace*{2em}count = 1 \\
	\hspace*{2em}while $count<ln-lp$ do \\
	\hspace*{3em}for $i=(p-1)*2^{ln-lp}+2^{count}$ to 
		$(p-1)*2^{ln-lp}+2^{ln-lp+1}$, $step=2^{count}$ do \\
	\hspace*{4em} compute a, c, and f according to Equation (\ref{eq:12}) \\
	\hspace*{3em} count = count + 1 \\
	\hspace*{2em}end do \\
	\\
  \hspace*{1em} send a, c and f to task\_id[p-1](if $p>1$) 
	and task\_id[p+1]($p<P$) \\
{\bf step 3:} \\
  \hspace*{1em} computing a, c and f parallelly in each task\_id[p], 
		p=1, ... ,P with data communication \\
	\hspace*{2em} count = 1 \\
	\hspace*{2em} while $count < lp$ do \\
	\hspace*{3em} if $p\%2^{count}=0$ do \\
	\hspace*{4em} receive a, c, and f from $p-2^{count-1}$ 
		and $p+2^{count-1}$ \\
	\hspace*{4em} compute a, c, and f according to Equation (\ref{eq:12})\\
	\hspace*{4em} if $p\%2^{count+1} != 0$ send a, c and f to 
		task\_id[$p-2^{count}$] and task\_id[$p+2^{count}$] \\
	\hspace*{3em} end do \\
	\hspace*{3em} count = count + 1 \\
	\hspace*{2em} end while\\
{\bf step 4:}\\
  \hspace*{1em} use Equation (\ref{eq:12}) to solve unknown nodes \\
\underline{\hspace{5.5in}} \\
}

\caption{Pseudo Code of the Middle Node Algorithm} \label{fig:listing}
\end{figure}


\section{Performance}

	\hspace{1em}
	Speedup is an important parameter to consider when evaluating 
	the performance
	of a parallel algorithm. It is normally defined as the ratio of
	the executing time for one processor to the executing time for
	$P$ processors. The executing time is the sum of computation time,
	data communication time, and idle time. The computation time of an
	algorithm ($T_{cal}$) is the time spent performing computation
	rather than communicating or idling. The communication time
	of an algorithm ($T_{com}$) is the time that its tasks spend
	sending and receiving messages. In the idealized multicomputer
	architecture, the cost of sending a message between two
	tasks located on different processors can be represented by
	two parameters: the message startup time, which is the time
	required to initiate the communication, and the transfer time
	per word. For speedup analysis in this paper, we consider only 
	computation time and communication time. The
	operations of addition and multiplication are also assumed to
	have the same operation time. In addition, because there is no 
	big data set that needs to be communicated 
	in the Middle Node Algorithm, the startup time
	is the main part for $T_{com}$. We can roughly take the value
	of $T_{com}$ as some constant times $T_{cal}$ \cite{Foster}.

	For the Middle Node Algorithm, six operations are needed to calculate
	$a_{i}^{(j)}$ and  $c_{i}^{(j)}$, nine operations
	are needed to calculate  $f_{i}^{(j)}$, and four operations
	are needed to calculate $x_{i}$. If the number of unknown nodes 
	$N$ is equal to $2^L-1$, $a_{i}^{(j)}$,  $c_{i}^{(j)}$, and
	$f_{i}^{(j)}$ will be calculated  $2N-\log_{2}(N+1)$ times
	and  $a_{i}^{(j)}$  has to be calculated $N$ times. Therefore,
	the total operations needed in this algorithm is equal to
	$46N-21\log_{2}(N+1)$, and the total calculation time is equal
	to $[46N-21\log_{2}(N+1)]T_{cal}$, where  $T_{cal}$ is the time
	needed for a unique addition or multiplication operation.

	For the parallel algorithm, $P$ processors are used on $N$ unknown
	nodes, where $P=2^{Lp}-1$. If $P$ is less than $N$, there are
	$4(n+1)/(P+1)-\log_{2}((N+1)/(P+1))-3$ times to calculate
	$a_{i}^{(j)}$,  $c_{i}^{(j)}$, and $f_{i}^{(j)}$ in each
	processor. Also there are  $(N+1)/2^{Lp}-1$ times to calculate
	$x_{i}$. Those operations are executed independently--that is,
	without data communication. Therefore, the total calculation
	time is $(92(N+1)/(P+1)-2\log_{2}[(N+1)/(P+1)])T_{cal}$.

	In the top $L_{p}-1$ layer, processors calculate 
	$a_{i}^{(j)}$,	$c_{i}^{(j)}$, $f_{i}^{(j)}$, and  $x_{i}$
	with data communication. Each processor calculates only one
	node on one level. The total time for calculation is equal to
	$25(\log_{2}(P+1)-1)T_{cal}$. For each node, the processor needs
	two sets of $a_{i}^{(j)}$,  $c_{i}^{(j)}$, and $f_{i}^{(j)}$,
	from both the left and the right neighbor processors and sends 
	$x_{i}$ to
	those neighbors. If the receiving procedure and sending procedure
	are assumed to take the same time, the total time spent on data
	communication is $3(\log_{2}(P+1)-1)T_{com}$. According to the
	definition of speedup, we have
	\[
	speedup = \frac{[46N - 21\log_{2}(N+1)]T_{cal}}
	{(92\frac{N+1}{P+1}-21 \log_{2}(\frac{N+1}{P+1})+25 \log_{2}
(\frac{P+1}{2}))T_{cal} + 3(\log_{2}(\frac{P+1}{2}))T_{com}}
	\]

	Figure~\ref{fig:fig4} is the speedup graph with total unknown
	node size equal to 2047 and three different constants chosen to set
	communication speeds. It can be seen from the figure that the 
	Middle Node Algorithm
	has a very good performance for the multiprocessor machine
	with intra-processor communication. 

	\begin{figure}
        \epsfxsize=5in
	%\vspace{3in}
        \centerline{\epsffile{fig-4c.ps}}
        \caption{Speedup Curves} \label{fig:fig4}
        \end{figure}

\section{Conclusion}

	\hspace{1em}
	When compared with the SOR method, our algorithm with the
	ADI method is always stable even when the matrix {\bf \it A} in
	Equation~(\ref{eq:4}) is not predominate. The ADI method can be
	used for all kinds of thermal and fluid problems while SOR can
	work only for a Poison equation. Similar to SOR, the ADI method
	is also a numerical method and needs a number of iterations to
	converge to an exact solution. Because the ADI method finds an
	exact linear equation solution in one dimension, it can save
	iteration times to find a convergent solution in the whole
	domain. Therefore, a good parallel ADI algorithm can be used in
	many engineering and scientific computation aspects.

	Thomas's algorithm has been widely used in other ADI method
	algorithms to solve the tridiagonal linear equation on serial
	machines. The Middle Node Algorithm presented in this paper
	is slower than Thomas's algorithm when only one processor is
	used. Both algorithms have same order notation, ${\cal O}(n)$,
	in general case, i.e. $a_i$, $b_i$ and $c_i$ have different
	values for $i=1,..., n$. But in many cases, they have identical
	value except for the two equations at boundary nodes, $i.e.$ $i=1$
	and $n$. In this case, the calculation operations to find $a_i$,
	$b_i$, and $c_i$ have to be taken only once on each level, and
	the Middle Node Algorithm will be much faster than Thomas's
	algorithm. Our algorithm is the best case with order
	${\cal O}(\log(n))$ when we limit the input to isotropic problems
	with unique space step.

	For parallel processing, the Middle Node Algorithm has the
	same speedup as the DAC algorithm when the
	total number of processor is less than 63.  In this range, the
	effective processing speed will increase at the rate of about
	$33$\%. There are two factors that affect the speedup. One is the
	communication startup time. At the current time, even for IBM SP2,
	the startup time is almost 40 times of the calculation speed,
	which is why this number was chosen for Figure~\ref{fig:fig4}. For
	the Middle Node Algorithm, the speedup is cut mainly by the
	second factor, which is the 
	idle processor. The executing processor number is cut in half as
	the level goes up. To improve the performance of the algorithm,
	it is good to choose the number of unknown nodes to be much larger
	than the total number of processors. If the problem is of the
	best case type, the speedup will increase significantly. In addition,
	the scalability will also improve, because the ratio of
	the data communication to the data calculation is much smaller.

	The evaluation of parallel methods is a complicated 
	project~\cite{jaja:book,kggk:book}. Both
	speedup and scalability are two of the main parameters taken
	into account during the evaluation. Unfortunately, they are
	only the parallelization performance description of a specific
	algorithm. Now people usually take the speed of the same algorithm
	with one processor as the reference for speedup calculation.
	This leads to some problems in the final results.  For example,
	the serial SOR method could be 64 times fast than the serial
	Gauss-Siedel method for some large-scale linear problem. A good
	parallel Gauss-Siedel algorithm with 64 processors can get a speedup
	of 64.	It is still difficult to say whether or not it is a 
	good parallel
	algorithm. Therefore, a good parallel algorithm for SOR must
	have a good order and good convergence speed, in addition to 
	speedup and scalability.

	In conclusion, we have developed a new parallel algorithm that
	matches or outperforms the other methods in our comparison group.
	We have shown that communication costs are the major part of
	its performance impact.  This leads us to believe that our
	algorithm will be more significant in the future with faster
	communication networks.

\baselineskip=12pt
\bibliography{paper,/staff/fredh/papers/bibdir/parallel}
\end{document}