PETSc: Misc: Threads and PETSc

With the advent of multicore machines as standard practice from laptops to supercomputers, the issue of a hybrid MPI-thread (MPI-OpenMP) "version" of PETSc is of interest. In this model, one MPI process per CPU (node) and several threads on that CPU (node) work together inside that MPI process. Many people throw around the term "hybrid MPI-OpenMP" as if it is a trivial change in the programming model. It is not -- major ramifications must be understood if such a hybrid approach is to be used successfully. Hybrid approaches can be developed in many ways that affect usability and performance.

The simplest model of PETSc with threads is to contain all the thread operations inside the Mat and Vec classes, leaving the user's programming model identical to what it is today. This model can be done in two ways by having Vec and Mat class implementations that use

OpenMP compiler directives to parallelize some of the methods
POSIX threads (pthread) calls to parallelize some of the methods.

In petsc-3.2 we have implemented approach 2 (pthreads). The corresponding classes are VECSEQPTHREAD and MATSEQAIJPTHREAD; these are very preliminary but can be explored. We found the code was never faster than pure MPI and cumbersome to use hence we have removed it.

A more complicated model of PETSc with threads, which would allow users to write threaded code that made PETSc calls, is not supported because PETSc is not currently thread-safe. Because the issues involved with toolkits and thread safety are complex, this short answer is almost meaningless. Thus, this page attempts to explain how threads and thread safety relate to PETSc. Note that we are discussing here only "software threads" as opposed to "hardware threads."

Threads are used in 2 main ways:

Loop-level compiler control. The C/C++/FORTRAN compiler manages the dispatching of threads at the beginning of a loop, where each thread is assigned non-overlapping selections from the loop. OpenMP, for example, defines standards (compiler directives) for indicating to compilers how they should "thread parallelize" loops.
User control. The programmer manages the dispatching of threads directly by assigning threads to tasks (e.g., a subroutine call). For example, consider POSIX threads (pthreads) or the user thread management in OpenMP.

Threads are merely streams of control and do not have any global data associated with them. Any global variables (e.g., common blocks in FORTRAN) are "shared" by all the threads; that is, any thread can access and change that data. In addition, any space allocated (e.g., in C with malloc or C++ with new) to which a thread has a reference can be read/changed by that thread. The only private data a thread has are the local variables in the subroutines that it has called (i.e., the stack for that thread) or local variables that one explicitly indicates are to be not shared in compiler directives.

In its simplest form, thread safety means that any memory (global or allocated) to which more than one thread has access, has some mechanism to ensure that the memory remains consistent when the various threads act upon it. This can be managed by simply associating a lock with each "memory" and making sure that each thread locks the memory before accessing it and unlocks when it has completed accessing the memory. In an object oriented library, rather than associating locks with individual data items, one can think about associating locks with objects; so that only a single thread can operate on an object at a time.

PETSc is not thread-safe for three reasons:

A few miscellaneous global variables - these may be fixed in later PETSc releases and are not a big deal.
A variety of global data structures that are used for profiling and error handling. These are not easily removed or modified to make thread-safe. The reason is that simply putting locks around all of these data accesses would have a major impact on performance, as the profiling data structures are being constantly updated by the running code. In some cases, for example updating the flop counters, locks could be avoided and something like atomic fetch and add operations could be used. It is possible to build such operations on all major systems, often using special two-step memory operations (e.g., load-link and store-conditional on MIPS and load-reservation on PowerXX). But this cannot be done in a portable, high-performance way, as such operations are not part of C, and there is no standardization on issuing asm code from within C source.
All the PETSc objects created during a simulation do not have locks associated with them. Again, the reason is performance; having always to lock an object before using it, and then unlocking it at the end would have a large impact on performance. Even with very inexpensive locks, there will still likely be a few "hot-spots" that kill performance. For example, if four threads share a matrix and are each calling MatSetValues(), this will be a bottleneck with those threads constantly fighting over the data structure inside the matrix object.

PETSc can be used in a limited way that is thread safe. So long as only one of the threads calls methods on objects, or calls profiled PETSc routines, PETSc will run fine. For example, one could use loop-level parallelism to evaluate a finite difference stencil on a grid. This is supported through the PETSc routines VecCreateShared(); see src/snes/examples/tutorials/ex5s.c. However, this is limited power.

Some concerns about a thread model for parallelism. A thread model for parallelism of numerical methods appears to be powerful for problems that can store their data in very simple (well controlled) data structures. For example, if field data is stored in a two-dimensional array, then each thread can be assigned a nonoverlapping slice of data to operate on. OpenMP makes managing much of this sort of thing reasonably straightforward.

When data must be stored in a more complicated opaque data structure (for example an unstructured grid or sparse matrix), it is more difficult to partition the data among the threads to prevent conflict and still achieve good performance. More difficult, but certainly not impossible. For these situations, perhaps it is more natural for each thread to maintain its own private data structure that is later merged into a common data structure. But to do this, one has to introduce a great deal of private state associated with the thread, i.e., it becomes more like a "light-weight process".

In conclusion, at least for the PETSc package, the concept of being thread-safe is not simple. It has major ramifications about its performance and how it would be used; it is not a simple matter of throwing a few locks around and then everything is honky-dory.

If you have any comments/brickbats on this summary, please direct them to [email protected]; we are interested in alternative viewpoints.