The default matrix representation within PETSc is the general sparse AIJ format (also called the Yale sparse matrix format or compressed sparse row format, CSR). This section discusses tips for efficiently using this matrix format for large-scale applications. Additional formats (such as block compressed row and block diagonal storage, which are generally much more efficient for problems with multiple degrees of freedom per node) are discussed below. Beginning users need not concern themselves initially with such details and may wish to proceed directly to Section Basic Matrix Operations . However, when an application code progresses to the point of tuning for efficiency and/or generating timing results, it is crucial to read this information.
In the PETSc AIJ matrix formats, we store the nonzero elements by rows, along with an array of corresponding column numbers and an array of pointers to the beginning of each row. Note that the diagonal matrix entries are stored with the rest of the nonzeros (not separately).
To create a sequential AIJ sparse matrix, A,
with m rows and n columns,
one uses the command
ierr = MatCreateSeqAIJ(PETSC_COMM_SELF,int m,int n,int nz,int *nnz,Mat *A);where nz or nnz can be used to preallocate matrix memory, as discussed below. The user can set nz=0 and nnz=PETSC_NULL for PETSc to control all matrix memory allocation.
The sequential and parallel AIJ matrix storage formats by default employ i-nodes (identical nodes) when possible. We search for consecutive rows with the same nonzero structure, thereby reusing matrix information for increased efficiency. Related options database keys are -mat_aij_no_inode (do not use inodes) and -mat_aij_inode_limit <limit> (set inode limit (max limit=5)). Note that problems with a single degree of freedom per grid node will automatically not use I-nodes.
By default the internal data representation for the AIJ formats employs zero-based indexing. For compatibility with standard Fortran storage, thus enabling use of external Fortran software packages such as SPARSKIT, the option -mat_aij_oneindex enables one-based indexing, where the stored row and column indices begin at one, not zero. All user calls to PETSc routines, regardless of this option, use zero-based indexing.
The dynamic process of allocating new memory and copying from the old storage to the new is intrinsically very expensive. Thus, to obtain good performance when assembling an AIJ matrix, it is crucial to preallocate the memory needed for the sparse matrix. The user has two choices for preallocating matrix memory via MatCreateSeqAIJ().
One can use the scalar nz to specify the expected number of nonzeros for each row. This is generally fine if the number of nonzeros per row is roughly the same throughout the matrix (or as a quick and easy first step for preallocation). If one underestimates the actual number of nonzeros in a given row, then during the assembly process PETSc will automatically allocate additional needed space. However, this extra memory allocation can slow the computation,
If different rows have very different numbers of nonzeros, one
should attempt to indicate (nearly) the exact number of elements
intended for the various rows with the optional array, nnz of
length m, where m is the number of rows, for example
int nnz[m]; nnz[0] = <nonzeros in row 0> nnz[1] = <nonzeros in row 1> .... nnz[m-1] = <nonzeros in row m-1>In this case, the assembly process will require no additional memory allocations if the nnz estimates are correct. If, however, the nnz estimates are incorrect, PETSc will automatically obtain the additional needed space, at a slight loss of efficiency.
Using the array nnz to preallocate memory is especially important for efficient matrix assembly if the number of nonzeros varies considerably among the rows. One can generally set nnz either by knowing in advance the problem structure (e.g., the stencil for finite difference problems on a structured grid) or by precomputing the information by using a segment of code similar to that for the regular matrix assembly. The overhead of determining the nnz array will be quite small compared with the overhead of the inherently expensive mallocs and moves of data that are needed for dynamic allocation during matrix assembly.
Thus, when assembling a sparse matrix with very different numbers of nonzeros in various rows, one could proceed as follows for finite difference methods:
For (vertex-based) finite element type calculations, an analogous procedure is as follows:
The -log_info option causes the routines
MatAssemblyBegin() and MatAssemblyEnd() to print
information about the success of the preallocation. Consider the
following example for the MATSEQAIJ matrix format:
MatAssemblyEnd_SeqAIJ:Matrix size 100 X 100; storage space: 2000 unneeded, 1000 used MatAssemblyEnd_SeqAIJ:Number of mallocs during MatSetValues is 0The first line indicates that the user preallocated 3000 spaces but only 1000 were used. The second line indicates that the user preallocated enough space so that PETSc did not have to internally allocate additional space (an expensive operation). In the next example the user did not preallocate sufficient space, as indicated by the fact that the number of mallocs is very large (bad for efficiency):
MatAssemblyEnd_SeqAIJ:Matrix size 1000 X 1000; storage space: 47 unneeded, 100000 used MatAssemblyEnd_SeqAIJ:Number of mallocs during MatSetValues is 40000Although at first glance such procedures for determining the matrix structure in advance may seem unusual, they are actually very efficient because they alleviate the need for dynamic construction of the matrix data structure, which can be very expensive.
Parallel sparse matrices with the AIJ
format can be created with the command
ierr = MatCreateMPIAIJ(MPI_Comm comm,int m,int n,int M,int N,int d_nz, int *d_nnz, int o_nz,int *o_nnz,Mat *A);A is the newly created matrix, while the arguments m, n, M, and N, indicate the number of local rows and columns and the number of global rows and columns, respectively. Either the local or global parameters can be replaced with PETSC_DECIDE, so that PETSc will determine them. The matrix is stored with a fixed number of rows on each processor, given by m, or determined by PETSc if m is PETSC_DECIDE.
If PETSC_DECIDE is not used for the arguments m and n, then the user must ensure that they are chosen to be compatible with the vectors. To do this, one first considers the matrix-vector product y = A x. The m that is used in the matrix creation routine MatCreateMPIAIJ() must match the local size used in the vector creation routine VecCreateMPI() for y. Likewise, the n used must match that used as the local size in VecCreateMPI() for x.
The user must set d_nz=0, o_nz=0, d_nnz=PETSC_NULL, and o_nnz=PETSC_NULL for PETSc to control dynamic allocation of matrix memory space. Analogous to nz and nnz for the routine MatCreateSeqAIJ(), these arguments optionally specify nonzero information for the diagonal ( d_nz and d_nnz) and off-diagonal ( o_nz and o_nnz) parts of the matrix. For a square global matrix, we define each processor's diagonal portion to be its local rows and the corresponding columns (a square submatrix); each processor's off-diagonal portion encompasses the remainder of the local matrix (a rectangular submatrix). The rank in the MPI communicator determines the absolute ordering of the blocks. That is, the process with rank 0 in the communicator given to MatCreateMPIAIJ contains the top rows of the matrix; the ith process in that communicator contains the ith block of the matrix.
As discussed above, preallocation of memory is critical for achieving good performance during matrix assembly, as this reduces the number of allocations and copies required. We present an example for three processors to indicate how this may be done for the MATMPIAIJ matrix format. Consider the 8 by 8 matrix, which is partitioned by default with three rows on the first processor, three on the second and two on the third.
The ``diagonal'' submatrix, d, on the first processor is given by
while the ``off-diagonal'' submatrix, o, matrix is given by
For the first processor one could set d_nz to 2 (since each row has 2 nonzeros) or, alternatively, set d_nnz to {2,2,}2. The o_nz could be set to 2 since each row of the o matrix has 2 nonzeros, or o_nnz could be set to {2,2,}2.
For the second processor the d submatrix is given by
Thus, one could set d_nz to 3, since the maximum number of nonzeros in each row is 3, or alternatively one could set d_nnz to {3,3,}2, thereby indicating that the first two rows will have 3 nonzeros while the third has 2. The corresponding o submatrix for the second processor is
so that one could set o_nz to 2 or o_nnz to {2,1,}1.
Note that the user never directly works with the d and o submatrices, except when preallocating storage space as indicated above. Also, the user need not preallocate exactly the correct amount of space; as long as a sufficiently close estimate is given, the high efficiency for matrix assembly will remain.
As described above, the option -log_info
will print information about the success of preallocation during
matrix assembly. For the MATMPIAIJ format, PETSc will also list
the number of elements owned by on each processor that were generated
on a different processor. For example, the statements
[0]MatAssemblyBegin_MPIAIJ:Number of off processor values 10 [1]MatAssemblyBegin_MPIAIJ:Number of off processor values 7 [2]MatAssemblyBegin_MPIAIJ:Number of off processor values 5indicate that very few values have been generated on different processors. On the other hand, the statements
[0]MatAssemblyBegin_MPIAIJ:Number of off processor values 100000 [1]MatAssemblyBegin_MPIAIJ:Number of off processor values 77777indicate that many values have been generated on the ``wrong'' processors. This situation can be very inefficient, since the transfer of values to the ``correct'' processor is generally expensive. By using the command MatGetOwnershipRange() in application codes, the user should be able to generate most entries on the owning processor.
Note: It is fine to generate some entries on the ``wrong'' processor. Often this can lead to cleaner, simpler, less buggy codes. One should never make code overly complicated in order to generate all values locally. Rather, one should organize the code in such a way that most values are generated locally.