\chapter{Process Topologies}
\label{sec:topol}
\label{chap:topol}
%Version of 4/27/95
\section{Introduction}
This chapter discusses the \MPI/ topology mechanism. A topology is an extra,
optional attribute that one can give to an intra-communicator; topologies
cannot be added to inter-communicators. A topology can provide a convenient
naming mechanism for the processes of a group (within a communicator), and
additionally, may assist the runtime system in mapping the processes onto
hardware.
As stated in chapter~\ref{chap:context},
a process group in \MPI/ is a collection of {\tt n} processes. Each process in
the group is assigned a rank between {\tt 0} and {\tt n-1}. In many parallel
applications a linear ranking of processes does not adequately reflect the logical
communication pattern of the processes (which is usually determined by the
underlying problem geometry and the numerical algorithm used). Often the
processes are arranged in topological patterns such as two- or
three-dimensional grids. More generally, the logical process arrangement is
described by a graph. In this chapter we will refer to this logical process
arrangement as the ``virtual topology.''
A clear distinction must be made between the virtual process topology and the
topology of the underlying, physical hardware. The virtual topology can be
exploited by the system in the assignment of processes to physical processors,
if this helps to improve the communication performance on a given machine. How
this mapping is done, however, is outside the scope of \MPI/. The description
of the virtual topology, on the other hand, depends only on the application,
and is machine-independent. The functions that are proposed in this chapter
deal only with machine-independent mapping.
\begin{rationale}
Though physical mapping is not discussed, the
existence of the virtual topology information may be used as advice by
the runtime system.
There are well-known techniques for mapping grid/torus structures to
hardware to\-po\-logies such as hypercubes or grids. For more
complicated graph structures good heuristics often yield nearly optimal
results \cite{suprenum}. On the other hand,
if there is no way for the user to specify the logical process
arrangement as a ``virtual topology,'' a random mapping is most
likely to result. On some machines, this will lead to
unnecessary contention in the interconnection network.
Some details about predicted and measured
performance improvements that result from good process-to-processor
mapping on modern wormhole-routing architectures can be found in
\cite{wormhole1,wormhole2}.
Besides possible performance benefits, the virtual topology can
function as a convenient, process-naming structure, with tremendous
benefits for program readability and notational power in
message-passing programming.
\end{rationale}
%
\section{Virtual Topologies}
%
The communication pattern of a set of processes can be represented by a
graph. The nodes stand for the processes, and the edges connect processes that
communicate with each other. \MPI/ provides message-passing between any pair
of processes in a group. There is no requirement for opening a channel
explicitly. Therefore, a ``missing link'' in the user-defined process graph
does not prevent the corresponding processes from exchanging messages. It
means rather that this connection is neglected in the virtual topology. This
strategy implies
that the topology gives no convenient way of naming this pathway of
communication. Another possible consequence is that an automatic mapping tool
(if one exists for the runtime environment) will not take account of this edge
when mapping. Edges in the communication graph are not weighted, so that
processes are either simply connected or not connected at all.
\begin{rationale}
Experience with similar
techniques in PARMACS \cite{parmacs1,parmacs2}
show that this information is usually sufficient for a good mapping.
Additionally, a more precise specification is more difficult
for the user to set up, and it would make the interface
functions substantially more complicated.
\end{rationale}
Specifying the virtual
topology in terms of a graph is sufficient for all applications. However, in
many applications the graph structure is regular, and the detailed set-up
of the graph would be inconvenient for the user and might be less
efficient at
run time. A large fraction of all parallel applications use process topologies
like rings, two- or higher-dimensional grids, or tori. These structures are
completely defined by the number of dimensions and the numbers of processes in
each coordinate direction. Also, the mapping of grids and tori is generally
an easier problem then that of general graphs. Thus, it is desirable to
address these cases explicitly.
Process coordinates in a cartesian structure begin their numbering at $0$.
Row-major numbering is always used for the processes in a
cartesian structure. This means that, for example, the relation
between group rank and coordinates for four processes in
a $(2 \times 2)$ grid is as follows.\\[2.0ex]
\hspace*{\parindent}
\begin{tabular}{ll}
coord (0,0): & rank 0 \\
coord (0,1): & rank 1 \\
coord (1,0): & rank 2 \\
coord (1,1): & rank 3 \\
\end{tabular}
%
\section{Embedding in \MPI/}
%
The support for virtual topologies as defined in this chapter is
consistent with other parts of \MPI/, and, whenever possible,
makes use of functions that are defined elsewhere.
Topology information is associated with communicators. It is added
to communicators using the caching mechanism described in
Chapter~\ref{chap:context}.
%The
%topology functions complement and interoperate with the collective
%communication functions described in Chapter~\ref{chap:coll}.
% How? (S Otto)
%
%
% The following is confusing and sounds contradictory to some statements
% made previously. Since it isn't that important...we take it out
% (S Otto)
%
%The current design of the process topology functions
%assumes that
%\MPI/ is free to choose any process in a group for any
%location in the topology. There are no special processes that
%have to be aligned with topology substructures. This assumption fits
%best to a ``local SPMD model'' in which all processes in the same group
%execute the same program.
%
%
\section{Overview of the Functions}
%
The functions \mpifunc{MPI\_GRAPH\_CREATE} and \mpifunc{MPI\_CART\_CREATE} are used
to create general (graph) virtual topologies and cartesian topologies, respectively.
These topology creation functions are collective. As with other collective
calls, the program must be written to work correctly, whether the call
synchronizes or not.
The topology creation functions take as input an existing communicator
\mpiarg{comm\_old},
which defines the set of processes on which the topology is to be
mapped. A new communicator \mpiarg{comm\_topol} is created that carries
the topological structure as cached information
(see Chapter~\ref{chap:context}). In analogy to function
\mpifunc{MPI\_COMM\_CREATE}, no cached information propagates from
\mpiarg{comm\_old} to \mpiarg{comm\_topol}.
\func{MPI\_CART\_CREATE} can be used to describe cartesian structures of
arbitrary dimension. For each coordinate direction one specifies whether the
process structure is periodic or not.
Note that an $n$-dimensional hypercube is
an $n$-dimensional torus with 2 processes per coordinate direction. Thus,
special support for hypercube structures is not necessary. The local
auxiliary function \mpifunc{MPI\_DIMS\_CREATE} can be used to compute a balanced
distribution of processes among a given number of dimensions.
\begin{rationale}Similar functions are contained in
EXPRESS \cite{express} and PARMACS. \end{rationale}
The function \mpifunc{MPI\_TOPO\_TEST} can be used to inquire about the
topology associated with a communicator. The topological information can be
extracted from the communicator using the functions
\mpifunc{MPI\_GRAPHDIMS\_GET} and \mpifunc{MPI\_GRAPH\_GET}, for general
graphs, and \mpifunc{MPI\_CARTDIM\_GET} and \mpifunc{MPI\_CART\_GET}, for
cartesian topologies. Several additional functions are provided to manipulate
cartesian topologies: the functions \mpifunc{MPI\_CART\_RANK} and
\mpifunc{MPI\_CART\_COORDS} translate cartesian coordinates into a group rank,
and vice-versa; the function \mpifunc{MPI\_CART\_SUB} can be used to extract a
cartesian subspace (analogous to \mpifunc{MPI\_COMM\_SPLIT}). The function
\mpifunc{MPI\_CART\_SHIFT} provides the information needed to communicate with
neighbors in a cartesian dimension. The two functions
\linebreak
\mpifunc{MPI\_GRAPH\_NEIGHBORS\_COUNT} and \mpifunc{MPI\_GRAPH\_NEIGHBORS} can
be used to extract the neighbors of a node in a graph. The function
\mpifunc{MPI\_CART\_SUB} is collective over the input communicator's group;
all other functions are local.
Two additional functions, \mpifunc{MPI\_GRAPH\_MAP} and
\mpifunc{MPI\_CART\_MAP} are presented in the last section. In general these
functions are not called by the user directly. However, together with the
communicator manipulation functions presented in Chapter~\ref{chap:context},
they are sufficient to implement all other topology functions.
Section \ref{subsec:topol-lowlevel} outlines such an implementation.
%
\section{Topology Constructors}
\label{subsec:topol-construct}
\subsection{Cartesian Constructor}
\begin{funcdef}{MPI\_CART\_CREATE(comm\_old, ndims, dims, periods, reorder, comm\_cart)}
\funcarg{\IN}{ comm\_old}{ input communicator (handle)}
\funcarg{\IN}{ ndims}{ number of dimensions of cartesian grid (integer)}
\funcarg{\IN}{ dims}{ integer array of size \mpiarg{ndims} specifying the number of processes in each dimension}
\funcarg{\IN}{ periods}{ logical array of size \mpiarg{ndims} specifying whether the grid is periodic (\const{true}) or not (\const{false}) in each dimension}
\funcarg{\IN}{ reorder}{ ranking may be reordered (\const{true}) or not (\const{false}) (logical)}
\funcarg{\OUT}{ comm\_cart}{ communicator with new cartesian topology (handle)}
\end{funcdef}
\mpibind{MPI\_Cart\_create(MPI\_Comm~comm\_old, int~ndims, int~*dims, int~*periods, int~reorder, MPI\_Comm~*comm\_cart)}
\mpifbind{MPI\_CART\_CREATE(COMM\_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM\_CART, IERROR)\fargs INTEGER COMM\_OLD, NDIMS, DIMS(*), COMM\_CART, IERROR \\ LOGICAL PERIODS(*), REORDER}
\mpifunc{MPI\_CART\_CREATE} returns a handle to a new communicator to which the
cartesian topology information is attached. If \mpiarg{reorder = false} then
the rank of each process in the new group is identical to its rank in the old
group. Otherwise, the function may reorder the processes (possibly so as to
choose a good embedding of the virtual topology onto the physical machine).
If the total size of the cartesian grid is smaller than the size of the group
of \mpiarg{comm}, then some processes are returned \const{MPI\_COMM\_NULL}, in
analogy to \mpifunc{MPI\_COMM\_SPLIT}. The call is erroneous if it specifies
a grid that is larger than the group size.
%\begin{users}
%The option of not reordering the process ranks can be useful in the
%case of an overlay of different topologies. The user can then fully
%control which location in the first topology coincides with which
%location in the second one. The price that has to be paid
%in this case is that the process arrangement can only be optimized for
%one topology.\end{users}
%
\subsection{Cartesian Convenience Function: \func{MPI\_DIMS\_CREATE}}
For cartesian topologies, the function \func{MPI\_DIMS\_CREATE} helps the user
select a balanced distribution of processes per coordinate direction,
depending on the number of processes in the group to be balanced and optional
constraints that can be specified by the user. One use is to partition all
the processes (the size of \const{MPI\_COMM\_WORLD}'s group) into an
$n$-dimensional topology.
\begin{funcdef}{MPI\_DIMS\_CREATE(nnodes, ndims, dims)}
\funcarg{\IN}{ nnodes}{ number of nodes in a grid (integer)}
\funcarg{\IN}{ ndims}{ number of cartesian dimensions (integer)}
\funcarg{\INOUT}{ dims}{ integer array of size {\tt ndims} specifying the number of nodes in each dimension }
\end{funcdef}
\mpibind{MPI\_Dims\_create(int~nnodes, int~ndims, int~*dims)}
\mpifbind{MPI\_DIMS\_CREATE(NNODES, NDIMS, DIMS, IERROR)\fargs INTEGER NNODES, NDIMS, DIMS(*), IERROR}
The entries in the array \mpiarg{dims} are set to describe a cartesian grid
with \mpiarg{ndims} dimensions and a total of \mpiarg{nnodes} nodes. The
dimensions are set to be as close to each other as possible, using an
appropriate divisibility algorithm. The caller may
further constrain the operation of this routine by specifying elements
of array {\tt dims}. If {\tt dims[i]} is set to a positive number, the
routine will not modify the number of nodes in dimension {\tt i};
only those entries where {\tt dims[i] = 0} are modified by the call.
Negative input values of {\tt dims[i]} are erroneous.
An error will occur if {\tt nnodes} is not a multiple of
$\displaystyle \prod_{i, dims[i]\neq 0} dims[i]$.
For {\tt dims[i]} set by the call, {\tt dims[i]} will be ordered in
non-increasing order. Array {\tt dims} is suitable for use as input to
routine \func{MPI\_CART\_CREATE}. \mpifunc{MPI\_DIMS\_CREATE} is local.
\vspace*{1.0ex}
\begin{example} {\rm
\label{topol-exA}
\begin{tabular}{|l|l|l|}
\hline
{\tt dims} & function call & {\tt dims} \\
before call & & on return \\
\hline
(0,0) & \mpifunc{MPI\_DIMS\_CREATE(6, 2, dims)} & (3,2) \\
(0,0) & \mpifunc{MPI\_DIMS\_CREATE(7, 2, dims)} & (7,1) \\
(0,3,0) & \mpifunc{MPI\_DIMS\_CREATE(6, 3, dims)} & (2,3,1) \\
(0,3,0) & \mpifunc{MPI\_DIMS\_CREATE(7, 3, dims)} & erroneous call \\
\hline
\end{tabular}
\vspace*{4.0ex}
} \end{example}
\subsection{General (Graph) Constructor}
\begin{funcdef}{MPI\_GRAPH\_CREATE(comm\_old, nnodes, index, edges, reorder, comm\_graph)}
\snir
\funcarg{\IN}{ comm\_old}{ input communicator (handle)}
\rins
\funcarg{\IN}{ nnodes}{ number of nodes in graph (integer)}
\funcarg{\IN}{ index} {array of integers describing node degrees (see below)}
\funcarg{\IN}{ edges} {array of integers describing graph edges (see below)}
\funcarg{\IN}{ reorder}{ ranking may be reordered (\const{true}) or not (\const{false}) (logical)}
\funcarg{\OUT}{ comm\_graph}{ communicator with graph topology added (handle)}
\end{funcdef}
\mpibind{MPI\_Graph\_create(MPI\_Comm~comm\_old, int~nnodes, int~*index, int~*edges, int~reorder, MPI\_Comm~*comm\_graph)}
\mpifbind{MPI\_GRAPH\_CREATE(COMM\_OLD, NNODES, INDEX, EDGES, REORDER, COMM\_GRAPH, IERROR)\fargs INTEGER COMM\_OLD, NNODES, INDEX(*), EDGES(*), COMM\_GRAPH, IERROR \\ LOGICAL REORDER}
\func{MPI\_GRAPH\_CREATE} returns a handle to a new communicator to which the
graph topology information is attached. If \mpiarg{reorder = false} then the
rank of each process in the new group is identical to its rank in the old
group. Otherwise, the function may reorder the processes. If the size,
\mpiarg{nnodes}, of the graph is smaller than the size of the group of
\mpiarg{comm}, then some processes are returned \const{MPI\_COMM\_NULL}, in
analogy to \mpifunc{MPI\_CART\_CREATE} and \mpifunc{MPI\_COMM\_SPLIT}. The call
is erroneous if it specifies a graph that is larger than the group size of the
input communicator.
The three parameters \mpiarg{nnodes, index} and \mpiarg{edges} define the graph
structure.
\mpiarg{nnodes} is the number of nodes of the graph. The nodes are numbered
from {\tt 0} to {\tt nnodes-1}.
The {\tt i}th entry of array \mpiarg{index} stores the total number of
neighbors of the first {\tt i} graph nodes. The lists of neighbors of
nodes {\tt 0, 1, \ldots, nnodes-1} are stored in consecutive locations in array
\mpiarg{edges}. The array \mpiarg{edges} is a flattened representation
of the edge lists.
The total number of entries in \mpiarg{index} is \mpiarg{nnodes} and
the total number of entries in \mpiarg{edges} is equal to the number of
graph edges.
The definitions of the arguments {\tt nnodes}, {\tt index}, and
{\tt edges} are illustrated with the following simple example.
\begin{example} {\rm
\label{topol-exB}
Assume there are four processes 0, 1, 2, 3 with the following
adjacency matrix: \\[2.0ex]
%
\hspace*{\parindent}
\begin{tabular}{|c|l|}
\hline
process & neighbors \\
\hline
0 & 1, 3 \\
1 & 0 \\
2 & 3 \\
3 & 0, 2 \\
\hline
\end{tabular}
\vspace*{2.0ex}
Then, the input arguments are: \\[2.0ex]
%
\hspace*{\parindent}
\begin{tabular}{ll}
nnodes = & 4 \\
index = & 2, 3, 4, 6 \\
edges = & 1, 3, 0, 3, 0, 2
\end{tabular}
\vspace*{2.0ex}
Thus, in C, {\tt index[0]} is the degree of node zero, and {\tt index[i] -
index[i-1]} is the degree of node {\tt i, i=1, \ldots, nnodes-1};
the list of neighbors of node zero is stored in {\tt edges[j]}, for
$\tt 0 \leq j \leq index[0]-1$ and the list of neighbors of node {\tt i},
$\tt i > 0$,
is stored in {\tt edges[j]}, $\tt index[i-1] \leq j \leq index[i]-1$.
In Fortran, {\tt index(1)} is the degree of node zero, and {\tt index(i+1) -
index(i)} is the degree of node {\tt i, i=1, \ldots, nnodes-1};
the list of neighbors of node zero is stored in {\tt edges(j)}, for
$\tt 1 \leq j \leq index(1)$ and the list of neighbors of node
{\tt i}, $\tt i > 0$,
is stored in {\tt edges(j)}, $\tt index(i)+1 \leq j \leq index(i+1)$.
} \end{example}
%The matrix is symmetric, i.~e., if node
%{\em i} is a neighbor of node {\em j}, than node {\em j} is also a
%neighbor of node {\em i}. This property can be used for an internal
%checking of the input arguments.
%ANY REASON TO ASSUME UNDIRECTED GRAPHS (and to list each edge twice)?
%
%
%\discuss{
%We opted to have nodes numbered from {\tt 0} to {\tt nnodes-1}, both in C and
%Fortran, so that the same {\tt index} and {\tt edges} lists represent a graph in
%C or in Fortran. However, since arrays start at zero in C and one in Fortran,
%this means that the correspondence between array indices and node numbers is
%different in C and Fortran.
%}
%
\begin{implementors}
The following topology information is likely to be stored with a communicator:
\begin{itemize}
\item Type of topology (cartesian/graph),
\item For a cartesian topology:
\begin{enumerate}
\item \verb+ndims+ (number of dimensions),
\item \verb+dims+ (numbers of processes per coordinate direction),
\item \verb+periods+ (periodicity information),
\item \verb+own_position+ (own position in grid, could also be computed
from rank and dims)
\end{enumerate}
\item For a graph topology:
\begin{enumerate}
\item \verb+index+,
\item \verb+edges+,
\end{enumerate}
which are the vectors defining the graph structure.
\end{itemize}
For a graph structure the number of nodes is equal to the number of processes
in the group. Therefore, the number of nodes does not have to be stored explicitly. An
additional zero entry at the start of array \mpiarg{index} simplifies
access to the topology information.
\end{implementors}
\subsection{Topology inquiry functions}
\label{subsec:topol-inquiry}
If a topology has been defined with one of the above functions, then the topology
information can be looked up using inquiry functions. They all are local
calls.
\begin{funcdef}{MPI\_TOPO\_TEST(comm, status)}
\funcarg{\IN}{ comm}{ communicator (handle)}
\funcarg{\OUT}{ status}{ topology type of communicator {\tt comm} (choice)}
\end{funcdef}
\mpibind{MPI\_Topo\_test(MPI\_Comm~comm, int~*status)}
\mpifbind{MPI\_TOPO\_TEST(COMM, STATUS, IERROR)\fargs INTEGER COMM, STATUS, IERROR}
The function \func{MPI\_TOPO\_TEST} returns the type of topology that
is assigned to a communicator.
The output value {\tt status} is one of the following:
\begin{constlist}
\constitem {MPI\_GRAPH}{ graph topology}
\constitem {MPI\_CART}{ cartesian topology}
\constitem {MPI\_UNDEFINED}{ no topology}
\end{constlist}
\begin{funcdef}{MPI\_GRAPHDIMS\_GET(comm, nnodes, nedges)}
\funcarg{\IN}{ comm}{ communicator for group with graph structure (handle)}
\funcarg{\OUT}{ nnodes}{ number of nodes in graph (integer) (same as number of processes in the group)}
\funcarg{\OUT}{ nedges}{ number of edges in graph (integer)}
\end{funcdef}
\mpibind{MPI\_Graphdims\_get(MPI\_Comm~comm, int~*nnodes, int~*nedges)}
\mpifbind{MPI\_GRAPHDIMS\_GET(COMM, NNODES, NEDGES, IERROR)\fargs INTEGER COMM, NNODES, NEDGES, IERROR}
Functions \func{MPI\_GRAPHDIMS\_GET} and
\func{MPI\_GRAPH\_GET} retrieve the graph-topology information
that was associated with a communicator by
\func{MPI\_GRAPH\_CREATE}.
The information provided by \func{MPI\_GRAPHDIMS\_GET} can be used
to dimension the
vectors {\tt index} and {\tt edges} correctly for the following call
to \func{MPI\_GRAPH\_GET}.
\begin{funcdef}{MPI\_GRAPH\_GET(comm, maxindex, maxedges, index, edges)}
\funcarg{\IN}{ comm}{ communicator with graph structure (handle)}
\funcarg{\IN}{ maxindex}{ length of vector \mpiarg{index} in the calling program \\ (integer)}
\funcarg{\IN}{ maxedges}{ length of vector \mpiarg{edges} in the calling program \\ (integer)}
\funcarg{\OUT}{ index}{ array of integers containing the graph structure (for details see the definition of \func{MPI\_GRAPH\_CREATE})}
\funcarg{\OUT}{ edges}{ array of integers containing the graph structure }
\end{funcdef}
\mpibind{MPI\_Graph\_get(MPI\_Comm~comm, int~maxindex, int~maxedges, int~*index, int~*edges)}
\mpifbind{MPI\_GRAPH\_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES, IERROR)\fargs INTEGER COMM, MAXINDEX, MAXEDGES, INDEX(*), EDGES(*), IERROR}
%
%
%
\begin{funcdef}{MPI\_CARTDIM\_GET(comm, ndims)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\OUT}{ ndims}{ number of dimensions of the cartesian structure (integer)}
\end{funcdef}
\mpibind{MPI\_Cartdim\_get(MPI\_Comm~comm, int~*ndims)}
\mpifbind{MPI\_CARTDIM\_GET(COMM, NDIMS, IERROR)\fargs INTEGER COMM, NDIMS, IERROR}
The functions \func{MPI\_CARTDIM\_GET} and
\func{MPI\_CART\_GET} return the cartesian topology information that was
associated with a communicator by
\func{MPI\_CART\_CREATE}.
\begin{funcdef}{MPI\_CART\_GET(comm, maxdims, dims, periods, coords)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\IN}{ maxdims}{ length of vectors {\tt dims, periods}, and {\tt coords} in the calling program (integer)}
\funcarg{\OUT}{ dims}{ number of processes for each cartesian dimension (array of integer)}
\funcarg{\OUT}{ periods}{ periodicity (\const{true}/\const{false}) for each cartesian dimension (array of logical)}
\funcarg{\OUT}{ coords}{ coordinates of calling process in cartesian structure (array of integer)}
\end{funcdef}
\mpibind{MPI\_Cart\_get(MPI\_Comm~comm, int~maxdims, int~*dims, int~*periods, int~*coords)}
\mpifbind{MPI\_CART\_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERROR)\fargs INTEGER COMM, MAXDIMS, DIMS(*), COORDS(*), IERROR \\ LOGICAL PERIODS(*)}
%\discuss{
%The inquiries for general graphs and for cartesian topology information is thus
%split up into two calls.
%The reason is that otherwise the user might not know a safe
%value for the dimension of vectors that will store the graph or cartesian grid
%description.
%}
\begin{funcdef}{MPI\_CART\_RANK(comm, coords, rank)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\IN}{ coords}{ integer array (of size {\tt ndims}) specifying the cartesian coordinates of a process }
\funcarg{\OUT}{ rank}{ rank of specified process (integer)}
\end{funcdef}
\mpibind{MPI\_Cart\_rank(MPI\_Comm~comm, int~*coords, int~*rank)}
\mpifbind{MPI\_CART\_RANK(COMM, COORDS, RANK, IERROR)\fargs INTEGER COMM, COORDS(*), RANK, IERROR}
For a process group with cartesian structure, the function
\func{MPI\_CART\_RANK} translates the logical process coordinates to process
ranks as they are used by the point-to-point routines.
For dimension {\tt i} with {\tt periods(i) = true}, if the coordinate,
{\tt coords(i)}, is out of range, that is, {\tt coords(i) $<$ 0} or
{\tt coords(i) $\geq$ dims(i)}, it is shifted back to the interval
\linebreak
{\tt 0 $\leq$ coords(i) $<$ dims(i)} automatically. Out-of-range
coordinates are erroneous for non-periodic dimensions.
\begin{funcdef}{MPI\_CART\_COORDS(comm, rank, maxdims, coords)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\IN}{ rank}{ rank of a process within group of \mpiarg{comm} (integer)}
\funcarg{\IN}{ maxdims}{ length of vector {\tt coord} in the calling program (integer)}
\funcarg{\OUT}{ coords}{ integer array (of size {\tt ndims}) containing the cartesian coordinates of specified process (integer)}
\end{funcdef}
\mpibind{MPI\_Cart\_coords(MPI\_Comm~comm, int~rank, int~maxdims, int~*coords)}
\mpifbind{MPI\_CART\_COORDS(COMM, RANK, MAXDIMS, COORDS, IERROR)\fargs INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR}
The inverse mapping, rank-to-coordinates translation is provided by
\func{MPI\_CART-} \linebreak \func{\_COORDS}. %mansplit
\begin{funcdef}{MPI\_GRAPH\_NEIGHBORS\_COUNT(comm, rank, nneighbors)}
\funcarg{\IN}{ comm}{ communicator with graph topology (handle)}
\funcarg{\IN}{ rank}{ rank of process in group of \mpiarg{comm} (integer)}
\funcarg{\OUT}{ nneighbors}{ number of neighbors of specified process (integer)}
\end{funcdef}
\mpibind{MPI\_Graph\_neighbors\_count(MPI\_Comm~comm, int~rank, int~*nneighbors)}
\mpifbind{MPI\_GRAPH\_NEIGHBORS\_COUNT(COMM, RANK, NNEIGHBORS, IERROR)\fargs INTEGER COMM, RANK, NNEIGHBORS, IERROR}
\func{MPI\_GRAPH\_NEIGHBORS\_COUNT} and \func{MPI\_GRAPH\_NEIGHBORS} provide
adjacency information for a general, graph topology.
\begin{funcdef}{MPI\_GRAPH\_NEIGHBORS(comm, rank, maxneighbors, neighbors)}
\funcarg{\IN}{ comm}{ communicator with graph topology (handle)}
\funcarg{\IN}{ rank}{ rank of process in group of \mpiarg{comm} (integer)}
\funcarg{\IN}{ maxneighbors}{ size of array \mpiarg{neighbors} (integer)}
\funcarg{\OUT}{ neighbors}{ ranks of processes that are neighbors to specified process (array of integer)}
\end{funcdef}
\mpibind{MPI\_Graph\_neighbors(MPI\_Comm~comm, int~rank, int~maxneighbors, int~*neighbors)}
\mpifbind{MPI\_GRAPH\_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS, IERROR)\fargs INTEGER COMM, RANK, MAXNEIGHBORS, NEIGHBORS(*), IERROR}
\begin{example} { \rm
\label{topol-exC}
Suppose that {\tt comm} is a communicator with a
shuffle-exchange topology. The group has $2^n$ members.
Each process is labeled by $a_1 , \ldots, a_n$ with $a_i \in
\{0,1\}$, and has three neighbors:
exchange($a_1 , \ldots, a_n ) = a_1 ,\ldots, a_{n-1}, \bar{a}_n$
($\bar{a} =
1-a$), shuffle($a_1 , \ldots, a_n )= a_2 , \ldots,
a_{n}, a_1$, and unshuffle($a_1 , \ldots, a_n ) = a_n , a_1 , \ldots , a_{n-1}$.
The graph adjacency list is illustrated below for $n=3$.
\\[3.0ex]
\begin{centering}
\begin{tabular}{|cc|ccc|}
\hline
\multicolumn{2}{|c|}{\bf node}&{\bf exchange}&{\bf shuffle}&{\bf unshuffle}\\
& & neighbors(1) & neighbors(2) & neighbors(3) \\
\hline
0 & (000) & 1 & 0 & 0\\
1 & (001) & 0 & 2 & 4\\
2 & (010) & 3 & 4 & 1\\
3 & (011) & 2 & 6 & 5\\
4 & (100) & 5 & 1 & 2\\
5 & (101) & 4 & 3 & 6\\
6 & (110) & 7 & 5 & 3\\
7 & (111) & 6 & 7 & 7\\
\hline
\end{tabular}
\end{centering}
\vspace{.5cm}
Suppose that the communicator {\tt comm} has this topology associated with it.
The following code fragment cycles through the three types of neighbors
and performs an appropriate permutation for each.
\begin{verbatim}
C assume: each process has stored a real number A.
C extract neighborhood information
CALL MPI_COMM_RANK(comm, myrank, ierr)
CALL MPI_GRAPH_NEIGHBORS(comm, myrank, 3, neighbors, ierr)
C perform exchange permutation
CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(1), 0,
+ neighbors(1), 0, comm, status, ierr)
C perform shuffle permutation
CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(2), 0,
+ neighbors(3), 0, comm, status, ierr)
C perform unshuffle permutation
CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(3), 0,
+ neighbors(2), 0, comm, status, ierr)
\end{verbatim}
} \end{example}
\subsection{Cartesian Shift Coordinates}
\label{subsec:topol-shift}
If the process topology is a cartesian structure, a \func{MPI\_SENDRECV}
operation is likely to be used along a coordinate direction to perform a shift
of data. As input, \func{MPI\_SENDRECV} takes the rank of a source process
for the receive, and the rank of a destination process for the send. If the
function \func{MPI\_CART\_SHIFT} is called for a cartesian process group, it
provides the calling process with the above identifiers, which then can be
passed to \func{MPI\_SENDRECV}. The user specifies the coordinate direction
and the size of the step (positive or negative). The function is local.
\begin{funcdef}{MPI\_CART\_SHIFT(comm, direction, disp, rank\_source, rank\_dest)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\IN}{ direction}{ coordinate dimension of shift (integer)}
\funcarg{\IN}{ disp}{ displacement ($> 0$: upwards shift, $< 0$: downwards shift) (integer)}
\funcarg{\OUT}{ rank\_source}{ rank of source process (integer)}
\funcarg{\OUT}{ rank\_dest}{ rank of destination process (integer)}
\end{funcdef}
\mpibind{MPI\_Cart\_shift(MPI\_Comm~comm, int~direction, int~disp, int~*rank\_source, int~*rank\_dest)}
\mpifbind{MPI\_CART\_SHIFT(COMM, DIRECTION, DISP, RANK\_SOURCE, RANK\_DEST, IERROR)\fargs INTEGER COMM, DIRECTION, DISP, RANK\_SOURCE, RANK\_DEST, IERROR}
\snir
The \mpiarg{direction} argument indicates the dimension of the shift, i.e.,
the coordinate which value is modified by the shift. The coordinates
are numbered from 0 to {\tt ndims-1}, when {\tt ndims} is the number
of dimensions.
\rins
Depending on the periodicity of the cartesian group in the specified
coordinate direction, \func{MPI\_CART\_SHIFT} provides the identifiers for a
circular or an end-off shift. In the case of an end-off shift,
the value \const{MPI\_PROC\_NULL} may be returned in \mpiarg{rank\_source} or
\mpiarg{rank\_dest},
indicating that the source or the destination for the shift is out of range.
\begin{example} { \rm
\label{topol-exD}
The communicator, \mpiarg{comm}, has a two-dimensional, periodic, cartesian
topology associated with it. A two-dimensional array of {\tt REAL}s is stored
one element per process, in variable {\tt A}. One wishes to skew this array,
by shifting column {\tt i} (vertically, i.e., along the column) by
{\tt i} steps.
\begin{verbatim}
....
C find process rank
CALL MPI_COMM_RANK(comm, rank, ierr))
C find cartesian coordinates
CALL MPI_CART_COORDS(comm, rank, maxdims, coords, ierr)
C compute shift source and destination
CALL MPI_CART_SHIFT(comm, 0, coords(2), source, dest, ierr)
C skew array
CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, dest, 0, source, 0, comm,
+ status, ierr)
\end{verbatim}
} \end{example}
\snir
\begin{users}
In Fortran, the dimension indicated by \const{DIRECTION
= i} has
\const{DIMS(i+1)} nodes, where
\const{DIMS} is the array that
was used to create the grid. In C, the dimension
indicated by \const{direction = i} is the dimension specified by
\const{dims[i]}.
\end{users}
\rins
\subsection{Partitioning of Cartesian structures}
\label{subsec:topol-part}
\begin{funcdef}{MPI\_CART\_SUB(comm, remain\_dims, newcomm)}
\funcarg{\IN}{ comm}{ communicator with cartesian structure (handle)}
\funcarg{\IN}{ remain\_dims}{ the {\tt i}th entry of \mpiarg{remain\_dims} specifies whether the \linebreak {\tt i}th dimension is kept in the subgrid ({\tt true}) or is drop\-ped ({\tt false}) (logical vector)}
\funcarg{\OUT}{ newcomm}{ communicator containing the subgrid that includes the calling process (handle)}
\end{funcdef}
\mpibind{MPI\_Cart\_sub(MPI\_Comm~comm, int~*remain\_dims, MPI\_Comm~*newcomm)}
\mpifbind{MPI\_CART\_SUB(COMM, REMAIN\_DIMS, NEWCOMM, IERROR)\fargs INTEGER COMM, NEWCOMM, IERROR \\ LOGICAL REMAIN\_DIMS(*)}
If a cartesian topology has been created with \func{MPI\_CART\_CREATE}, the
function \linebreak \func{MPI\_CART\_SUB} can be used to partition the
communicator group into subgroups that form lower-dimensional cartesian
subgrids, and to build for each subgroup a communicator with the associated
subgrid cartesian topology. (This function is closely related to
\mpifunc{MPI\_COMM\_SPLIT}.)
\begin{example} {\rm
\label{topol-exE}
Assume that \func{MPI\_CART\_CREATE}\verb+(..., comm)+ has defined a
$(2 \times 3 \times 4)$ grid. Let {\tt remain\_dims = (true, false, true)}.
Then a call to,
\begin{verbatim}
MPI_CART_SUB(comm, remain_dims, comm_new),
\end{verbatim}
will create three communicators each with eight processes
in a $2 \times 4$ cartesian
topology. If {\tt remain\_dims = (false, false, true)} then the call to
\func{MPI\_CART\_SUB(comm, remain\_dims, comm\_new)}
will create six non-overlapping
communicators, each with four processes,
in a one-dimensional cartesian topology.
} \end{example}
\subsection{Low-level topology functions}
\label{subsec:topol-lowlevel}
%
The two additional functions introduced in this section can be used to
implement all other topology functions. In general they will not be
called by the user directly, unless he or she is creating additional
virtual topology capability other than that provided by \MPI/.
\begin{funcdef}{MPI\_CART\_MAP(comm, ndims, dims, periods, newrank)}
\funcarg{\IN}{ comm}{ input communicator (handle)}
\funcarg{\IN}{ ndims}{ number of dimensions of cartesian structure (integer)}
\funcarg{\IN}{ dims}{ integer array of size {\tt ndims} specifying the number of processes in each coordinate direction }
\funcarg{\IN }{ periods}{ logical array of size {\tt ndims} specifying the periodicity specification in each coordinate direction}
\funcarg{\OUT}{ newrank}{ reordered rank of the calling process; \const{MPI\_UNDEFINED} if calling process does not belong to grid (integer)}
\end{funcdef}
\mpibind{MPI\_Cart\_map(MPI\_Comm~comm, int~ndims, int~*dims, int~*periods, int~*newrank)}
\mpifbind{MPI\_CART\_MAP(COMM, NDIMS, DIMS, PERIODS, NEWRANK, IERROR)\fargs INTEGER COMM, NDIMS, DIMS(*), NEWRANK, IERROR \\ LOGICAL PERIODS(*)}
\func{MPI\_CART\_MAP}
computes an ``optimal'' placement for the calling process on the
physical machine. A possible implementation of this function is to always
return the rank of the calling process, that is, not to perform any reordering.
\begin{implementors}
The function \mpifunc{MPI\_CART\_CREATE(comm, ndims, dims,
periods, reorder, comm\_cart)}, with {\tt reorder = true} can be implemented by
calling
\mpifunc{MPI\_CART\_MAP(comm, ndims, dims, periods, newrank)}, then calling
\linebreak
\mpifunc{MPI\_COMM\_SPLIT(comm, color, key, comm\_cart)}, with
{\tt color = 0} if {\tt newrank $\neq$
\linebreak
MPI\_UNDEFINED}, {\tt color = MPI\_UNDEFINED} otherwise,
and {\tt key = newrank}.
The function \func{MPI\_CART\_SUB(comm, remain\_dims, comm\_new)} can be
implemented by a call to \func{MPI\_COMM\_SPLIT(comm, color, key, comm\_new)},
using a single number encoding of the lost dimensions as {\tt color} and a
single number encoding of the preserved dimensions as {\tt key}.
All other cartesian topology functions can be implemented locally, using
the topology information that is cached with the communicator.
\end{implementors}
The corresponding new function for general graph structures is as follows.
\begin{funcdef}{MPI\_GRAPH\_MAP(comm, nnodes, index, edges, newrank)}
\funcarg{\IN}{ comm}{ input communicator (handle)}
\funcarg{\IN}{ nnodes}{ number of graph nodes (integer)}
\funcarg{\IN}{ index}{integer array specifying the graph structure, see \linebreak \mpifunc{MPI\_GRAPH\_CREATE}}
\funcarg{\IN}{ edges}{integer array specifying the graph structure}
\funcarg{\OUT}{ newrank}{ reordered rank of the calling process; \const{MPI\_UNDEFINED} if the calling process does not belong to graph (integer)}
\end{funcdef}
\mpibind{MPI\_Graph\_map(MPI\_Comm~comm, int~nnodes, int~*index, int~*edges, int~*newrank)}
\mpifbind{MPI\_GRAPH\_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERROR)\fargs INTEGER COMM, NNODES, INDEX(*), EDGES(*), NEWRANK, IERROR}
\begin{implementors}
The function \mpifunc{MPI\_GRAPH\_CREATE(comm, nnodes, index, edges,
reorder, comm\_graph)},
with {\tt reorder = true} can be implemented by calling
\mpifunc{MPI\_GRAPH\_MAP(comm, nnodes, index, edges, newrank)},
then calling
\linebreak
\mpifunc{MPI\_COMM\_SPLIT(comm, color, key, comm\_graph)}, with {\tt color = 0}
if {\tt newrank $\neq$
\linebreak
MPI\_UNDEFINED}, {\tt color = MPI\_UNDEFINED}
otherwise, and {\tt key = newrank}.
All other graph topology functions can be implemented locally, using the
topology information that is cached with the communicator.
\end{implementors}
%\discuss{
%An alternative implementation strategy is to use functions
%\mpifunc{MPI\_CART\_MAP} and \mpifunc{MPI\_GRAPH\_MAP} that return a group
%structure (i.e., the new ranks of all processes), next use
%\mpifunc{MPI\_COMM\_CREATE} to generate the new communicator.
%}
\section{An Application Example}
\label{topol-applic-example}
%
\begin{example} { \rm
\label{topol-exF}
The example in figure~\ref{poisson} shows how the grid definition and
inquiry functions can be used in an application program. A partial
differential equation, for instance the Poisson equation, is to be
solved on a rectangular domain.
First, the processes organize themselves in a two-dimensional
structure. Each process then inquires about the ranks of its
neighbors in the four directions (up, down, right, left).
The numerical problem is solved by an iterative method, the details
of which are hidden in the subroutine {\tt relax}.
In each relaxation step each process computes new values for the solution grid
function at all points owned by the process. Then the values at inter-process
boundaries have to be exchanged with neighboring processes. For example, the
exchange subroutine might contain a call like
\func{MPI\_SEND(...,neigh\_rank(1),...)} to send updated values to the
left-hand neighbor {\tt (i-1,j)}. } \end{example}
\begin{figure}
{\tt
\begin{tabbing}
=====\===\======\kill
%
\>integer ndims, num\_neigh \\
\>logical reorder \\
\>parameter (ndims=2, num\_neigh=4, reorder=.true.) \\
\>integer comm, comm\_cart, dims(ndims), neigh\_def(ndims), ierr \\
\>integer neigh\_rank(num\_neigh), own\_position(ndims), i, j \\
\>logical periods(ndims) \\
\>real$\ast$8 u(0:101,0:101), f(0:101,0:101) \\
\>data dims / ndims $\ast$ 0 / \\
\>comm = MPI\_COMM\_WORLD \\
C \hspace{5mm} Set process grid size and periodicity \\
\>call MPI\_DIMS\_CREATE(comm, ndims, dims,ierr) \\
\>periods(1) = .TRUE. \\
\>periods(2) = .TRUE. \\
C \hspace{5mm} Create a grid structure in WORLD group and inquire
about own position \\
\>call MPI\_CART\_CREATE (comm, ndims, dims, periods, reorder,
comm\_cart,ierr) \\
\>call MPI\_CART\_GET (comm\_cart, ndims, dims, periods, own\_position,ierr)
\\
C \hspace{5mm} Look up the ranks for the neighbors.
Own process coordinates are (i,j). \\
C \hspace{5mm} Neighbors are (i-1,j), (i+1,j), (i,j-1), (i,j+1) \\
\>i = own\_position(1) \\
\>j = own\_position(2) \\
\>neigh\_def(1) = i-1 \\
\>neigh\_def(2) = j \\
\>call MPI\_CART\_RANK (comm\_cart, neigh\_def, neigh\_rank(1),ierr) \\
\>neigh\_def(1) = i+1 \\
\>neigh\_def(2) = j \\
\>call MPI\_CART\_RANK (comm\_cart, neigh\_def, neigh\_rank(2),ierr) \\
\>neigh\_def(1) = i \\
\>neigh\_def(2) = j-1 \\
\>call MPI\_CART\_RANK (comm\_cart, neigh\_def, neigh\_rank(3),ierr) \\
\>neigh\_def(1) = i \\
\>neigh\_def(2) = j+1 \\
\>call MPI\_CART\_RANK (comm\_cart, neigh\_def, neigh\_rank(4),ierr) \\
C \hspace{5mm} Initialize the grid functions and start the
iteration \\
\>call init (u, f) \\
\>do 10 it=1,100 \\
\>\>call relax (u, f) \\
C \hspace{5mm} Exchange data with neighbor processes \\
\>\>call exchange (u, comm\_cart, neigh\_rank, num\_neigh) \\
10 \>continue \\
\>call output (u) \\
\>end \\
\end{tabbing}
}
\caption{Set-up of process structure for two-dimensional
parallel Poisson solver.}
\label{poisson}
\end{figure}
%
%-------------- end of LaTex source of topology chapter ----------------