\documentstyle[twocolumn,epsf]{article}
\pagestyle{empty}
\setlength{\textwidth}{6.9in}
\setlength{\textheight}{8.75in}
\setlength{\columnsep}{2.5pc}
\setlength{\topmargin}{-0.3in}
\setlength{\oddsidemargin}{-.3in}
\setlength{\parindent}{1pc}
\makeatletter
\def\@normalsize{\@setsize\normalsize{12pt}\xpt\@xpt \abovedisplayskip 11pt
plus2pt minus5pt\belowdisplayskip \abovedisplayskip \abovedisplayshortskip \z@
plus3pt\belowdisplayshortskip 6pt plus3pt minus3pt\let\@listi\@listI}
%the following line was changed
\def\subsize{\@setsize\subsize{12pt}\xipt\@xipt}
\def\section{\@startsection{section}{1}{\z@}{24pt plus 2 pt
minus 2 pt} {12pt plus 2pt minus 2pt}{\large\bf}}
\def\subsection{\@startsection {subsection}{2}{\z@}{12pt
plus 2pt minus 2pt}{12pt plus 2pt minus 2pt}{\subsize\bf}}
\makeatother

\input{psfig}


\begin{document} 
\bibliographystyle{plain}

\date{}
\title{\LARGE \bf Parallel Inversion of Polynomial Matrices}
\author{
\begin{tabular}[t]{ccc}
Alina Solovyova-Vincent	& Frederick C. Harris, Jr.	& M. Sami Fadali \\
Computer Science & Computer Science & Electrical Engineering \\
University of Nevada, Reno & University of Nevada, Reno	& University of Nevada, Reno\\
Reno, NV 89557 		& Reno, NV 89557 	& Reno, NV 89557 \\
alina@cs.unr.edu	& fredh@cs.unr.edu	& fadali@ee.unr.edu
\end{tabular}
}
\maketitle
\thispagestyle{empty} 
\subsection*{\centering Abstract} 
\vspace*{-3mm} 
%\hspace{1em}
	This paper presents an overview of different methods proposed
	in the last several decades for computing the inverse of a
	polynomial matrix, concentrating on Bus\l owicz's algorithm.
	A detailed description of Bus\l owicz's algorithm and its
	sequential implementation is followed by the presentation of a new
	parallel algorithm, based on Bus\l owicz's.  The distributed and 
	shared memory versions of this parallel algorithm are discussed,
	and the resulting computation times are analyzed and compared.

	\noindent
     {\bf keywords:} parallel algorithm, polynomial matrix inversion

\section{Introduction}

	The problem of inverting polynomial matrices (or,
	more generally, rational matrices) has been under investigation
	for over half of a century. This research is well
	motivated because the computation of such inverses is needed in
	many fields. For instance, in multivariable control systems,
	a system is often described by a matrix of rational transfer
	functions. The problem of finding the inverse of a rational
	matrix arises in analysis and design using the inverse Nyquist
	array method~\cite{Mac,Rosenbrock-74}, in parameterization
	design of linear decoupling controllers~\cite{Lin2,Pavel},
	in robust stability analysis~\cite{Fadali}, and in design using the
	QFT method~\cite{Horowitz,Mac}. The inversion of polynomial
	matrices is also required in various fields of control system
	synthesis~\cite{Kail,Wol}. Furthermore,
	the inversion of rational matrices is required in the analysis
	and synthesis of passive and active RLC networks for
	inversion of admittance or impedance matrices~\cite{Keon} 
	and in the analysis
	of power systems using the method of diakoptics~\cite{Bram}. When
	a rational matrix is expressed as a ratio of a numerator
	polynomial matrix and a denominator scalar polynomial, the
	computation of the inverse essentially reduces to the computation
	of the inverse of a polynomial matrix~\cite{Lin}. Thus, in many
	cases, the problem of finding the inverse of a rational matrix
	can be solved by inverting the corresponding polynomial matrix.

	The rest of this paper is outlined as follows: Section 2
	introduces definitions and notations and provides the overview
	of other existing inversion methods along with their advantages
	and disadvantages. It also introduces Bus\l owicz's algorithm and
	outlines reasons for selecting this algorithm as the basis for
	a parallel implementation. Section 3 describes the details
	of the sequential implementation of the algorithm as well as
	the changes necessary to parallelize it.  Discussions of the 
	shared memory and
	distributed memory  parallel implementations complete this section.
	Section 4 presents and analyzes the results of the sequential
	and parallel versions of the program. Conclusions and directions
	for future work are provided in Section 5.


\section{Notation, Literature Review, and  Bus\l owicz's Algorithm}

%	\hspace{1em}
%	This section presents the problem of inverting a polynomial
%	matrix. Section 2.1 introduces the notation and defines some of
%	the terms commonly used in this area of research.  Section 2.2
%	provides an overview of different methods and algorithms
%	appearing in the literature for inversion of rational and
%	polynomial matrices. Section 2.3 focuses on Bus\l owicz's
%	algorithm~\cite{Bus} and gives its detailed description.
%	We concentrate on this algorithm because it is the basis for
%	our parallel implementation discussed in Section 3.2.

   \subsection{Introduction of Notation}

	A polynomial matrix is a matrix which has polynomials in all of
	its entries.  Consider a polynomial matrix $H(s)$ of degree $n$

	\[
	H(s)=H_{n}s^{n}+H_{n-1}s^{n-1}+H_{n-2}s^{n-2}+...+H_{0}, 
	\]
	where $H_{i}$ are $r \times r$ constant square matrices, $i=0,...,n$. 
	An example of such a matrix is

%

	\[
	H(s)=\left[ 
	\begin{array}{ll}
	s+2 & s^{3}+3s^{2}+s \\ 
	s^{3} & s^{2}+1
	\end{array}
	\right] .
	\]
	In this case, the degree of the polynomial matrix is $n=3,$ and
	the size of the matrix $H_{i}$ is $r$ $=$ $2$. 
	For this example,

	\[
	H_{0}=\left[ 
	\begin{array}{ll}
	2 & 0 \\ 
	0 & 1
	\end{array}
	\right] , H_{1}=\left[ 
	\begin{array}{ll}
	1 & 1 \\ 
	0 & 0
	\end{array} ,
	\right]
	\]
	\[
	 H_{2}=\left[ 
	\begin{array}{ll}
	0 & 3 \\ 
	0 & 1
	\end{array}
	\right] , H_{3}=\left[ 
	\begin{array}{ll}
	0 & 1 \\ 
	1 & 0
	\end{array}
	\right] .
	\]
%
%
%	\[
%	H(s)=\left[ 
%	\begin{array}{ll}
%	s^{3}+2s+1 & s^{3}+s^{2} \\ 
%	s & s^{2}+s
%	\end{array}
%	\right] . 
%	\]
%	In this case, the degree of the polynomial matrix is $n=3,$ and
%	the size of the matrix $H_{i}$ is $r$ $=$ $2$. 
%	\newpage
%	For this example,
%
%	\[
%	H_{0}=\left[ 
%	\begin{array}{ll}
%	1 & 0 \\ 
%	0 & 0
%	\end{array}
%	\right] ,H_{1}=\left[ 
%	\begin{array}{ll}
%	2 & 0 \\ 
%	1 & 1
%	\end{array}
%	\right] ,H_{2}=\left[ 
%	\begin{array}{ll}
%	0 & 1 \\ 
%	0 & 1
%	\end{array}
%	\right] ,H_{3}=\left[ 
%	\begin{array}{ll}
%	1 & 1 \\ 
%	0 & 0
%	\end{array}
%	\right] . 
%	\]
	$H(s)$ is considered column proper if its highest degree
	coefficient matrix $H_{n}$ is non-singular~\cite{Wol}. 
	$H(s)$ is row proper
	if its transpose, $H^{T}(s)$, is column proper.

	The notation used to denote the inverse of a matrix is
	$H^{-1}(s).$ Only unimodular matrices ($i.e.$, polynomial matrices
	with a non-zero determinant that is independent of $s$) have
	inverses that are themselves polynomial matrices~\cite{Mun}.

	Rational matrices are matrices whose entries are rational
	functions in $s$, which are non-singular at $s=0$. A rational
	function can be expressed as $Hd^{-1}$, where $H$ is a polynomial
	matrix and $d$ is a scalar polynomial. Thus the problem of
	inverting a rational matrix can be reduced to inverting a
	polynomial matrix.

   \subsection{Literature Review} \label{sec:litreview}

%	\hspace{1em} Over the last several decades there have been
%	several methods proposed for the inversion of polynomial or
%	rational matrices.  This subsection provides an overview of those
%	methods and considers some of the advantages and disadvantages
%	of each approach.

	We begin with a review of 
	the special case of inverting the resolvent matrix
	$\left[ sI_{r}-H\right]$, 
	where $I$ is the unit matrix and
	$H$ is an $r\times r$ matrix of constants.  A process for
	finding
	$\left[ sI_{r}-H\right]^{-1}$
	is well documented and is known as Leverrier's
	algorithm~\cite{LEVERRIER}. Leverrier's algorithm as well as
	multiple extensions of this method ($i.e.$,~Leverrier-Faddeev
	algorithm~\cite{FADDEEV}, Souriau-Frame-Faddeev
	algorithm~\cite{ROSENBROCK}, {\it etc.}) serve as a basis for
	several matrix inversion techniques that follow.

	A number of different approaches for the inversion of polynomial
	matrices have been proposed over the past years. The assumptions
	made by different authors vary, and the results do not always
	have the same form. One of the first papers on this topic
	by Kosut~\cite{Kos} gives a direct algorithm based on a
	generalization of the Leverrier method. His method contains many
	polynomial operations and is not very general.

	Munro and Zakian~\cite{Mun} used the approach suggested by
	Kosut for the inversion of rational polynomial matrices by the
	Souriau-Frame-Faddeev algorithm. They considered two distinct
	methods in their paper: one based on the Gaussian elimination
	algorithm and the other one based on the Faddeev algorithm. Both
	methods involve performing direct computation of the adjoint
	matrix obtained by polynomial operations. However, their methods
	have downfalls in that such operations are lengthy, require a
	``large degree of involved bookkeeping''~\cite{Mun}, and are
	known to cause numerical problems.  In addition, operations in
	the field of rational functions utilized in both methods are
	not suitable for computer programming.

	Downs~\cite{D3,D4}  presented another approach, based
	on exact Gaussian elimination for matrices with integer
	coefficients. His method still contained many polynomial
	operations. Almost at the same time, Emre $et $ $al.$~\cite{Emre1}
	proposed a method of inversion of rational matrices based on
	Cramer's rule. The primary motivation for introducing this new
	method was to avoid polynomial arithmetic and to establish an
	algorithm systematically dealing with constant matrices. This
	approach required only simple arithmetic. Their method originally
	required restrictive assumptions that the polynomial matrix
	$H(s)$ is non-singular at $s=0$ and that the determinant is known
	at the outset. The problem of polynomial cancellation was not
	taken into account by Emre $et$ $al$. Downs was the one to point
	out the many restrictions and problems of their approach. In a
	series of papers that followed~\cite{D5,Emre3,Emre2}, most of
	these problems were resolved. Another point worth mentioning is
	that the inversion presented by Emre $et$ $al.$ was carried out
	by computing the determinants recursively.  This inversion method
	requires that the determinants of $(n+1)r$ constant matrices be
	evaluated in order to compute the determinant of a polynomial
	matrix $H(s)$ of order $r$ whose degree is $n$. Computation time
	for this method is large for large $r$ and $n$.

	Inouye~\cite{Ino} approached the problem of inverting polynomial
	matrices by generalizing Fadeev's recursive formula. His method
	is an extension of the Souriau-Frame-Faddeev algorithm. It does
	not require prerequisite determinants and requires operations
	with only constant matrices. It simultaneously determines the
	adjoint matrices and the coefficients of the determinants.
	The author showed that his algorithm is ``faster than the
	existing ones.'' One of the disadvantages of his method is that
	it works only for row- or column-proper polynomial matrices. It
	also gives the inverse in the minimal degree form only if the
	polynomial matrix to be inverted is not a special form, but
	it cannot ensure that the denominator and inversion numerator
	matrix obtained will be irreducible for a general case.

	Num~\cite{Num}, and much later Schuter and Hippe~\cite{Sch},
	proposed finding the inverse by generalizing known polynomial
	interpolation approaches. Both techniques require a careful
	choice of base points in order to avoid ill-conditioned
	equations. Both methods require complex computations. Another
	problem with interpolation methods is that only upper bounds
	for the degrees of the determinant and the adjoint are usually
	available. The interpolation thus involves redundant equations
	and polynomials with unnecessarily high degrees.

	In 1980 Bus\l owicz~\cite{Bus} published a paper with a method
	that is similar to the method proposed by Inouye~\cite{Ino} but
	more general in that it works for any non-singular polynomial
	matrix (as opposed to only row- or column-proper matrices). 
	Bus\l owicz's
	recursive algorithm computes the inverse by Cramer's rule,
	explicitly calculating the adjoint matrix and the determinant
	starting from the coefficient matrices. It requires operations
	with only constant matrices. The drawback of Bus\l owicz's
	algorithm is that the irreducible form cannot be ensured in
	general.

	Still another approach was developed independently by
	Zhang~\cite{Zhang} and Chang $et$ $al$.~\cite{Chang}. They both
	used a division algorithm for polynomial matrices to compute
	the inverse in irreducible form; however, their algorithms had
	increased computational complexity.

	Fragulis $et$ $al$.~\cite{Fra} proposed an algorithm that is
	a generalization of the Leverrier-type algorithm. The inverse
	is calculated using the recursive formula. Their approach does
	not seem to be significantly different from the one proposed by
	Bus\l owicz and does not provide any clear advantage over it.

	The method of  finding the inverse of a polynomial
	matrix based on state space realizations is used by Lin and
	Hsieh~\cite{Lin}. They compute neither the determinant nor the
	adjoint matrix. This method does not yield the exact solution,
	but experiments show that the algorithm gives accurate results
	for rational matrices that arise in the analysis and design
	of linear multivariable control systems.

   \subsection{Bus\l owicz's Algorithm}

%	\hspace{1em}This subsection explains the reasons for choosing
%	the algorithm developed by Bus\l owicz~\cite{Bus} for finding the
%	inverse of a polynomial matrix. It also describes the algorithm
%	itself.

   \subsubsection{Advantages of Bus\l owicz's Algorithm}

	\hspace{1pc}
	As mentioned in Section~\ref{sec:litreview},
	Bus\l owicz based his approach for finding the inverse of a
	polynomial matrix on a generalization of Fadeev's recursive
	formula. A similar method was proposed by Inouye~\cite{Ino} in
	1979, and it was the fastest and most general method at that
	time. 
	%One major drawback of the latter algorithm was that fact
	%that it was suitable only for row- or column-proper polynomial
	%matrices. 
	Bus\l owicz's algorithm is even more general, does
	not require knowledge of the determinant at the outset,     and
	works for any non-singular polynomial matrix. The only operations
	required are those on constant matrices.


	There were several reasons for choosing this method for
	implementation.  First, methods proposed before Bus\l owicz
	published his paper were obviously less general. Second, we
	wanted to implement an exact method, thus eliminating available
	algorithms that use approximations or interpolations such as
	\cite{Num}, \cite{Sch} and \cite{Lin}. Two other newer methods
	\cite{Chang,Zhang}, provide only slight improvement of the
	results in that they yield the inverse in the already irreducible
	form. However, both authors agree that their algorithms require
	additional ``complex computations''. Finally, Bus\l owicz
	claimed that his algorithm was suitable for computer programming
	\cite{Bus}.  We agreed with this assessment and also saw a
	potential for great speedup in the parallel implementation.


   \subsubsection{The Algorithm}

	\hspace{1pc}
	One of the general ways to compute the inverse of
	a matrix $H(s)$ is to evaluate the expression given by

	\begin{equation}
	H^{-1}(s)=\frac{{\rm adj}\, H(s)}{\det H(s)},  \label{H^-1}
	\end{equation}
	where ${\rm adj}\,H(s)$ denotes the adjacent matrix $H(s)$,
	which is found as

	\begin{equation}
	{\rm adj}\,H(s)=\sum\limits_{k=0}^{n(r-1)}Q_{k}s^{k},\ Q_{k}\in
	R^{r\times r}  \label{adj}
	\end{equation}
	and
	\begin{equation}
	\det H(s)=\sum\limits_{k=0}^{nr}a_{k}s^{k},\ a\in R.
	\label{detH}
	\end{equation}
	The problem of finding the inverse of a polynomial matrix comes
	down to finding an efficient method for calculating matrices
	$Q_{k}$, $ k=0,1,...,n(r-1)$, and the coefficients $a_{k},$
	$k=0,1,...,rn$, from the given matrices $H_{i}$, $i=0,1,...,n.$


	Bus\l owicz showed in his paper that the matrices $Q_{k}$ of
	${\rm adj}\,H(s)$ can be computed as
	\begin{equation}
	Q_{k}=(-1)^{r+1}R_{r-1,k}, \ k=0,1,...,n(r-1),  \label{Q}
	\end{equation}
	and the coefficients $a_{k}$ of $\det H(s)$ can be found using
	the formula
	\begin{equation}
	a_{k}=\frac{(-1)^{r+1}}{r} {\rm tr}G_{r,k}, \ k=0,1,...,nr
	\label{a}
	\end{equation}
	where tr denotes the trace of a matrix.

	The matrices $R_{r-1,k}$ and $G_{r,k}$ appearing in the above
	expressions are computed from the following iterative formulae:
	{\small
	\begin{eqnarray}
	G_{i,k} &=&H_{0}R_{i-1,k}+H_{1}R_{i-1,k-1}+...+H_{n}R_{i-1,k-n}, 
		\label{G} 
	\end{eqnarray}
	}
	\begin{equation}
	a_{i,k}=-\frac{1}{i} {\rm tr}G_{i,k},\ i=1,2,...,r, \ {\rm and}
		\label{a(i,k)}
	\end{equation}
	\begin{equation}
	R_{i,k}=G_{i,k}+I_{r}a_{i,k}, \ i=1,2,...,r-1 \ {\rm and }\ 
	k=0,1,...,in,  \label{R}
	\end{equation}

	\noindent
	where 
	\begin{equation}
	R_{0,k}=\left\{ 
		\begin{array}{ll}
		I_{r}		& {\rm for} \ k=0 \\ 
		R_{0,k}= 0 & {\rm for } \ k\neq 0
		\end{array}
	\right.
	\end{equation}

	\noindent
	and
	\begin{equation}
	R_{i,k}=0 \ {\rm for} \ j<0 \ {\rm or} \ k<0 \ {\rm or} \ k>jn.
	\label{Rconditions}
	\end{equation}

	\noindent
	In addition,
	\begin{equation}
	R_{r,k}=G_{r,k}+I_{r}a_{r,k}=0, \ k=0,1,...,rn.
	\end{equation}

	The algorithm proposed by Bus\l owicz for inversion of the
	polynomial matrices then consists of the following steps:
		\begin{enumerate}
		\item Using formulae (\ref{G})-(\ref{Rconditions}), calculate
		$G_{i,k},a_{i,k}$ and $R_{i,k}$ for $i=1,2,...,r-1$
		and $k=0,1,...,in$, and calculate from formula~(\ref{Q})
		the matrices $Q_{k} \ for \ k=0,1,...,n(r-1)$.

		\item Using formulae (\ref{Q}) and (\ref{adj}), calculate
		the matrix $adjH(s).$

		\item Calculate the matrices $G_{r,k}$ for $k=0,1,...,rn$
		from the formulae (\ref {G})-(\ref{R}) and the coefficients
		$a_{i,k}$ of the polynomial $\det H(s)$ from formula
		(\ref{a}).

		\item From formula (\ref{detH}) calculate the polynomial
		$\det H(s).$

		\item Calculate the matrix $H^{-1}(s)$ from formula 
		(\ref{H^-1}).

		\item The computations could be checked using the
		following equation:
		\[
		G_{r,k}+(-1)^{r}a_{k}I_{r}=0, \ k=0,1,...,rn.
		\]
		Note: In the case where the polynomial matrix has no
		inverse, the coefficients $a_{k}$, $k=0,1,...,rn$,
		calculated by formula (\ref{a}) will be equal to zero.

		\end{enumerate}

\section{Sequential and Parallel Algorithms}

   \subsection{Sequential Algorithm}

	%Figures \ref{fig:sequential} and \ref{fig:check} show 
	%important parts of our implementation of Bus\l owicz's 
	%sequential algorithm. 
	Before the sequential code is discussed 
	in detail, we introduce the variables that appear
	in the program.

		\vspace{-6pt}
		%\begin{itemize}
		\begin{list}{$\bullet$}{\parsep=0pt \itemsep=0pt}
		\item Each $H_{i}$ is an $r\times r$ matrix. $n+1$ of
		them form the matrix $H(s)$, the inverse of which is to
		be computed (see the first formula in 2.1). Each $H_{i}$
		is represented by a 3-dimensional array $H[i][x][y]$,
		where $i=0,1,...,n$, $x=0,1,...,r-1$ and $ y=0,1,...,r-1$.

		\item Each $G_{i,k}$ is an $r\times r$ matrix (see
		formula (6) in 2.3.2), that is represented by a
		4-dimensional array $G[i][k][x][y]$, where $i=0,2,...,r$,
		$k=0,1,...,rn$, $ x=0,1,...,r-1$ and $y=0,1,...,r-1.$

		\item Each $R_{i,k}$ is an $r\times r$ matrix
		(see formulae (8)-(11)), that is represented by a
		4-dimensional array $R[i][k][x][y]$, where $i=0,2,...,r$,
		$k=0,1,...,rn$, $ x=0,1,...,r-1$ and $y=0,1,...,r-1.$
		%i=1,..r-1; k=0,..in

		\item Each $a_{i,k}$ is a coefficient (see formula (7)), 
		that is represented by a 2-dimensional array
		$a[i][k]$, where $i=0,1,...,r$ and $k=0,1,...,rn$.

		\item Each $a_{k}$ is a coefficient of det$H(s)$ (see
		formula (3)), that is represented by a 1-dimensional
		array $alpha[k]$, where $k=0,1,...,rn$.

		\item Each $Q_{k}$ is one of $r\times r$ matrices
		that compose adj$H(s)$ (see formula (\ref{adj})). They are
		represented by a 3-dimensional array $Q[k][x][y]$,
		where $i=0,1,...,n$, $x=0,1,...,r-1$ and $ y=0,1,...,r-1$.

		\item $Ident[x][y]$ is an $r\times r$ unit matrix.
		\end{list}
		%\end{itemize}
		\vspace{-6pt}

        %        \begin{figure}
        %        \input{s3-sequential}
        %        \caption{\label{fig:sequential} The sequential algorithm.}
        %        \end{figure}  


%	Figure \ref{fig:check} presents the final steps of the algorithm
%	along with the condition that checks for the correctness of the
%	computations (formula (11)). At this point, checking
%	the value of $a_{k}$ can be used to verify if a given polynomial
%	matrix is nonsingular (nonzero determinant). The
%	coefficients $a_{k}$ for $k=0,1,....,rn$ will be equal to zero
%	if the matrix is singular.

	Several changes had to be made to the algorithm outlined in
	Bus\l owicz's paper.  First of all, step 1 of the algorithm (see
	Section 2.3.2) specifies the calculation of $G_{i,k},$ $a_{i,k}$
	and $R_{i,k}$ for $i=1,2,...,r-1$ and $k=0,1,...,in$. However,
	our algorithm separates this step into two steps because
	different approaches are required for calculating the variables
	for $i=1$ and $i>1$. Thus, the algorithm first calculates $G_{i,k},$
	$a_{i,k}$ and $R_{i,k}$ for $i=1$ and $k=0,1,...in$ and then
	continues with the rest of the calculations for $i>1$. Second,
	as mentioned above, steps 1 and 3 of the algorithm are combined.
	Computations of $Q_{k}$ are delayed until everything else in
	steps 1 and 3 is calculated. Hence, step 2 is also performed
	later in the program. % (see Figure~\ref{fig:check}).

%                \begin{figure}[h]              
%                \input{s3-check}
%                \caption{\label{fig:check} Calculate and check the
%			coefficients of the inverse matrix.}
%                \end{figure}    
	
   \subsection{Parallel Algorithm}

	Armed with a working sequential version of
	Bus\l owicz's algorithm, we began analyzing program dependencies
	in order to decide on parallelization techniques. This section
	presents the details of the parallel implementation of 
	Bus\l owicz's algorithm and outlines the changes made 
	and challenges encountered in the process of
	parallelizing the sequential version of the program. 

                \begin{figure}
                \input{s3-parallel}
                \caption{\label{fig:parallel} The parallel algorithm.}
                \end{figure}  

	The parallel algorithm is given in Figure~\ref{fig:parallel}. Two
	new variables appear in this parallel code segment: {\it NUMPROC}
	is the number of processors used for calculations, and {\it
	p} is the distinct number associated with each processor {\it
	p=}0{\it ,...,NUMPROC--}1. Because a SPMD (single program multiple data)
	programming structure was used, each processor executed the code
	shown on its portion of data. This algorithm was implemented
	for both distributed memory and shared memory machines.

	Implementing the program in a shared memory environment allowed
	the creation of variables that could be accessed directly
	by every process.     In the shared memory environment, the
	shared memory segments are created using the {\it shmget()}
	system calls. Because there is a limit on the number of shared
	memory segments that can be created, 2-, 3- and 4-dimensional
	matrices are represented as 1-dimensional arrays. Shared memory
	segments are attached to the data segments of the calling
	process before performing calculations using {\it shmat()}
	and then are detached after the computations are completed.  In a
	distributed memory environment, variables computed by one process
	that are required by another have to be passed explicitly by
	the program.  MPI 
	was chosen to provide the functionality required for programming
	in a distributed memory environment.

	Because most parallelism occurs in the loops, a first attempt to
	parallelize any code typically requires looking for independent loops
	that can be split across multiple processors. Independent loops
	can be executed in any order without affecting the semantics
	of the program. There are several $for$ loops in the sequential
	program, but, unfortunately, not all are independent. Clearly,
	the large outer $i$-loop (line 15 in Figure~\ref{fig:parallel})
	is not independent.  Calculations in the $i^{th}$ iteration depend
	on the results of the previous $(i-1)^{st}$ iteration because the
	$i^{th}$ iteration involves operations on $R_{i-1,k}$ (line 31).
	However, the $k$-loops 
	are independent and can
	be parallelized (lines 1 and 17 in Figure~\ref{fig:parallel}).
	This parallelization was accomplished by performing striped 
	partitioning of the matrices across the processors.

	The presence of the dependent loops in the program created another
	challenge: synchronization of the processes and data. Looking at lines
	15 and 17, one can notice that
	there are two nested loops, with an independent loop inside 
	the dependent one. To make matters worse, the number of inner
	iterations ($ll$-loop on line 22 varies.
	The dependence on $k$ can be seen in lines 19-22
	Thus there are so-called partially
	parallel loops, $i.e.$, loops whose parallelization requires
	synchronization to ensure that iterations are executed in the
	correct order and produce the correct output.  Specifically, no
	process can go on with execution of the $i^{th}$ iteration
	until every other process had completed its $(i-1)^{st}$
	iteration. 
	In a shared memory environment, this synchronization is 
	accomplished by placing
	a barrier before starting the next iteration of the $i$-loop
	(line 42 in Figure~\ref{fig:parallel}). 
	Another barrier
	is placed on line 14 (Figure~\ref{fig:parallel}) to synchronize
	the processes, making sure that $R_{1,k}$ is calculated for all
	values of $k$ before continuing with calculations for $i>1.$
	%
	In a distributed memory environment, synchronization is as important,
	but the data calculated by the processes must also be explicitly
	exchanged so it can be used in the next iteration by other
	processors.  This explicit exchange using MPI communication calls
	implicitly  accomplishes  the process synchronization required.
	The rest of the program is left unchanged from the sequential
	version.  

\section{Results}

	The distributed memory code using MPI was tested on three
	different platforms.  The first was a network of SGI O2
	workstations.  These machines have  180MHz MIPS R5000 processors
	with 320MB ram.  The second platform was a network of Pentium IV
	workstations, each with a 1.8 GHz processor and 256MB ram.  Both of
	these first two platforms have a standard 100 megabit network. The
	third platform is a cluster of Pentium IV Xeon processors, each
	at 2.2 GHz with 2GB of ram.  The communication network is
	Myrinet 2000.

	The shared memory code was tested on two different platforms.
	The first platform was an SGI Power Challenge 10000. This machine
	is a shared memory multiprocessor, consisting of 8 MIPS R10000
	and 1 GB of ram. The second platform was an SGI Origin 2000.
	This machine is a shared memory multiprocessor, consisting of 16
	MIPS R12000 300MHz processors and 2 GB of ram.

	The complexity of the implemented sequential algorithm is
	${\cal O}(n^{2}r^5)$.  Thus the run times increase rapidly as the
	problem size increases. The problem size can be increased either
	by scaling the degree of the polynomial matrix $n$, the size of
	the matrix $r$, or both. We considered only real-life cases in the
	field of control theory, where neither the size of the
	matrix nor the degree of the polynomial typically exceeds $25$.  
	The figures
	in this section illustrate the computation times of a sequential
	program under various conditions as well as computation times
	obtained on the distributed and shared memory platforms  
	with various numbers of processors.
	For comparison of the platforms, the
	sequential run times for the largest problem size are provided
	in Table~\ref{tab:table-seq}.

	\begin{table}
	\begin{center}
	\begin{tabular}{|c|crc|}\hline
	Platform & \multicolumn{3}{c|}{Sequential Time  (sec)}  \\ \hline
	SGI O2 NOW & \hspace{0.3in} &2645.30& \\ \hline
	P IV  NOW & &29.99& \\ \hline
	P IV Cluster & &18.75& \\ \hline
	SGI Power Challenge & &913.99& \\ \hline
	SGI Origin 2000 & &552.95& \\ \hline
	\end{tabular}
	\end{center}
	\caption{Sequential run times for different 
	platforms ($n=25$, $r=25$).}
	\label{tab:table-seq}
	\end{table}

	\subsection{Distributed Memory Implementation}

	The results obtained on the distributed memory platforms were
	not as expected.  On the first two platforms, the algorithm provided
	some speedup on two processors for all problem sizes.  However,
	when more processors were added, speedup was obtained only on the
	larger problem sizes, and the efficiency decreased drastically.
	On the third platform we obtained speedup across all processors,
	but the efficiency was poor for more than two processors.
	The efficiency of the parallel algorithm on the third distributed
	memory platform is shown in Table~\ref{tab:cortex-eff}.

	\begin{table}[h]
	\begin{center}
	\begin{tabular}{|c|c|c|c|} \hline
	Processors  & 2 & 4 & 8 \\ \hline
	Efficiency &  89.6\% & 68.8\% & 49.7\% \\ \hline
	\end{tabular}
	\end{center}
	\caption{Efficiency, P IV cluster ($n=25$, $r=25$).}
	\label{tab:cortex-eff}
	\end{table}

%\newpage
	\subsection{Shared Memory Implementation}

	The results obtained on both shared memory platforms were
	outstanding.  Figure~\ref{fig:fig1} represents the average
	computation times (in seconds) on the first shared memory platform
	for the case when the degree of the polynomial matrix was fixed
	($n=10$) and the matrix size was varied from $r=2$ to $r=25$.
	As we saw earlier, the algorithm is ${\cal O}(n^2r^5)$. 
	Figure~\ref{fig:8procmesh} provides
	a computation-time surface showing how the changes of problem
	size ($n$ and $r$) on 8 processors cause the computation time
	to increase drastically.

        \begin{figure}[h]
        \epsfxsize=4in
        \centerline{\epsffile{figs/nconst.ps}}
        \caption{Run times, SGI Power Challenge ($n=10$, $r$ is varied).} \label{fig:fig1}
        \end{figure}

        \begin{figure}
        \epsfxsize=4in
        \centerline{\epsffile{figs/8procmesh.ps}}
        \caption{Run times, SGI Power Challenge, 8 processors, varied $n$ and $r$.} 
		\label{fig:8procmesh}
        \end{figure}

	Table~\ref{tab:table1} presents the average
	efficiency of our algorithm for $n=25$ and $r=25$ on both shared
	memory platforms.

	\begin{table}[h]
	\begin{center}
	\begin{tabular}{|c|c|c|c|c|c|c|c|c|}\hline
	Processors & 2 & 3 & 4 \\ \hline
	SGI Power Challenge & 99.65\% & 98.4\% & 98.2\\ \hline
	SGI Origin 2000 & 99.9\% & 101.0\% & 98.7\% \\ \hline
	\end{tabular}

	\begin{tabular}{|c|c|c|c|c|c|c|c|c|}\hline
	5 & 6 & 7 & 8 & 16\\ \hline
	98.2\% & 97.9\% & 97.9\% & 95.8\% & n/a \\ \hline
	100.5\% & 99.0\% & 98.7\% & 98.2\% & 93.8\% \\ \hline
	\end{tabular}
	\end{center}
	\caption{Efficiency, shared memory platforms ($n=25$, $r=25$).}
	\label{tab:table1}
	\end{table}


\section{Conclusions and Future Work}

%	\hspace{1em} 
%	The results obtained reflect the fact that this problem is highly
%	data intensive.  Therefore, the need to transfer large amounts
%	of data between processors after each iteration overemphasized
%	the weaknesses of a distributed memory environment, making the
%	speedup minimal. At the same time the fast shared memory
%	implementation on the SGI machines  provided excellent performance
%	since the hardware handled the communication and exchange of data.

	The results obtained reflect the major difference between shared
	and distributed memory environments. Excellent performance in
	the shared memory environment shows that an efficient parallel
	algorithm can be  designed for the highly data intensive problem
	of polynomial matrix inversion. These excellent results are due
	to the fact that communication and exchange of data were handled
	in the hardware in the fast shared memory implementation on the
	SGIs. In the case of a distributed memory environment, however,
	the dependencies of the original algorithm required transfer of
	large amounts of data between processors after each iteration,
	thus emphasizing the weaknesses of that environment and leading
	to a minimal speedup.

	In the distributed memory environment, three platforms were used.
	%The first had slow CPUs and a slow network interface.  The second
	%had fast CPUs and the same slow network interface.  The third
	%platform had the fastest CPUs and a much faster network interface.
	Even with faster CPUs and faster networks, the efficiency did
	not improve much.  These poor results were due to the fact that
	the communication costs far outweighed the performance gain of
	multiple processors.  Because the problem sizes were limited to
	real-life applications, data sets of much larger sizes were not
	considered. 
	%However, if they were, performance would be expected
	%to increase.  So when the need for such problems arises, this
	%distributed memory implementation may be more appropriate.

	In the shared memory environment, near linear speedup was achieved
	on both platforms as can be seen in Table~\ref{tab:table1}.
	This speedup means that the algorithm can take full advantage of
	the distributed computing power in the shared memory environment
	as the size of the problem increases.  This great speedup can
	be attributed to the high degree of parallelism we were able to
	extract from the original algorithm as well as to the elimination
	of the need for the program to perform the communication explicitly.
	The efficiency over 100\% on the Origin 2000 can be attributed
	to the architecture design where the processors are 2 CPUs to
	a card and the cache is 8 MB per CPU.  This design allows the
	cache for each CPU to be used by either processor on the card,
	and allows one CPU to use the cache of both when the other CPU
	is idle.  The NUMA memory architecture, which is tightly coupled
	to the CPU cards, also contributes to the behavior.

	We have presented a parallel algorithm for computing
	the inverses of polynomial matrices.	We have performed an
	exhaustive search of all available algorithms for polynomial
	matrix inversion and  based our parallel algorithm on the method
	proposed in~\cite{Bus}.  We have implemented the sequential
	version as well as two parallel versions.  Based on the shared
	memory implementation results, we conclude that this new
	parallel algorithm is very efficient but should not be used on
	a distributed memory environment for small problem sizes.

	We see this work continuing in a variety of different ways. 
	First, there is an algorithm for inverting multivariable polynomial
	matrices~\cite{ozguler-80} that has never been parallelized.
	Second, we anticipate evaluating the distributed memory
	implementation in order to minimize message passing,
	thus improving performance.  Third, larger problem sizes may
	also be considered in order to determine when  the computation
	time overtakes the communication overhead.


%\baselineskip=12pt
\bibliography{paper}
\end{document}