Applications and Algorithm Challenges for PetaFLOPS Computing

DRAFT - Feb. 1996

PetaFLOPS computing---computing equal to a million billion floating-point operations per second---is expected to be feasible within the next twenty years. Ensuring that what may be becomes and usable presents major technical challenges. To provide a forum for discussion of those challenges, the workshop on ``Applications and Algorithm Challenges for PetaFLOPS Computing' ' was held on August 14--23, 1995, in Bodega Bay, California.

The workshop was the second in a series of meetings that started in Pasadena, California, in 1994. The first workshop focused on the technologies needed for PetaFLOPS computing systems. Complementing and expanding on that initial effort, the 1995 workshop centered on innovative applications and algorithms for PetaFLOPS systems. Sponsored by four federal agencies---ARPA, DOE, NASA, NSA, and NSF---this workshop brought together a dynamic mix of established and emerging researchers from government, industry, academia, and national laboratories.

Goal

The primary goal of this workshop was to develop a firmer understanding of the types of applications PetaFLOPS will make possible and what the barriers and technological research issues are. A secondary goal was to provide a roadmap for algorithm and applications development over the next five, ten, and twenty years.

Objectives

The following objectives were identified to achieve these goals:

+ Develop Case Studies. Analyze applications case studies and develop performance models of applications areas that may be enabled by PetaFLOPS systems. Determine the algorithm and software requirements, and identify the challenges to be overcome.

+ Describe Open Problems in Algorithms. Develop specific descriptions of unsolved algorithmic problems. Devise possible plans of attack, with estimates of the importance of each problem.

+ Prepare an Applications Development Roadmap. Establish a multiyear program for research in algorithm and application technologies. The program should be divided into near-term (5-year), middle-term (10-year), and long-term (20-year) plans.

+ Recommend a Research Program. Prepare a detailed agenda of the technical skills and resources needed to solve the algorithmic issues in order to enable the various petaFLOPS applications identified. Prepare a rationale justifying the agenda, from both the political and the scientific viewpoint.

Applications Case Studies

As a first step toward establishing detailed case studies, the participants prepared an alphabetical list categorizing scientific and engineering disciplines that could benefit from access to a petaFLOPS computer.

Subsequently, for each discipline, specific requirements for such factors as memory, I/O, and parallelism were calculated and listed. This list is organized by runtime requirements at 1 PetaFLOP.

Representative examples from the lists were then expanded into detailed case studies.

Algorithms

In some applications areas, algorithmic improvements have yielded gains comparable to the gains from faster hardware, Algorithms are extremely important for two other reasons:

  • Algorithms strongly constrain the choice of architectures and software environments.
  • Algorithms can partially compensate for architectural deficiencies.
  • Based on an analysis of the individual application case studies, we were able to prepare a list of algorithms that an effective petaFLOPS computer will need to handle.

  • block tridiagonal solvers - fusion
  • configuration interaction (CI) methods - computational chemistry
  • convolution algorithms - fusion
  • coupled clusters - computational chemistry
  • domain decomposition - climate modeling
  • fast Fourier transforms (3D, real-to-complex) - fusion
  • finite difference mesh - climate systems modeling
  • finite-element unstructured grid - chemically reacting flows
  • full configuration interaction methods (FCI) - computational chemistry
  • fully coupled implicit iterative solvers - chemically reacting flows
  • hierarchical methods -- for telesurgery
  • integer relation algorithms - automatic discovery of mathematical theorems
  • maximum likelihood -- reconstruction of DNA of extinct species
  • multigrid algorithms -- for particle dynamics, material design, stress modeling, convective turbulence
  • partitioning algorithms - theorem proving
  • perturbation theory (PT) - computational chemistry
  • piecewise parabolic methods - astrophysics
  • search space parallelization - theorem proving
  • self-consistent field methods - computational chemistry
  • singular value image decomposition - the all-earth survey
  • spectral transform method - climate systems modeling
  • table lookup and pattern matching - expert systems
  • Software

    While the principal focus of this workshop was on applications, a necessary adjunct to the discussions was software. The typical applications program currently runs at approximately 15% efficiency, and there is little indication that the situation will improve as problems and machines become larger. Thus, the issue was raised about how algorithms designers and software developers can help ease the job of the applications programmer and (one hopes) make high-performance codes more efficient.

    Programming Languages

    One solution is to develop a better parallel programming language. Currently, no ``perfect'' parallel programming language exists. Several advances have been made, most notably Fortran M and pC++. Nevertheless, there is no standard paradigm. Low-level approaches provide inadequate support for applications, and high-level approaches provide too little handle on performance. Moreover, important questions remain:

  • How can we better identify global state and determine which module last processed the data?
  • Should the programming model allow only remote access to stationary states?
  • Should debuggers analyze state transitions?
  • Should all data be tagged with state?
  • Templates

    Another software approach voiced by the workshop participants was that of modularity and reuse of code. At least one group felt that there is a level of modularity at which the code can be easy to design and can be reasonably efficient. The suggestion was made, therefore, that the petaFLOPS initiative include research into common patterns, with the objective of determining the most appropriate level of code modularity.

    Building Blocks

    Related to the use of modules is the use of portable building blocks. Portable building blocks seem necessary particular for large-scale matrix problems that arise in numerical linear algebra.

    Several packages exist for handling numerical linear algebra problems. These include EISPACK, a package of routines for solving eigenvalue problems released in 1976; LINPACK, a package of routines for solving linear systems and the singular value decomposition, released in 1979; and LAPACK, a package released in 1992 that uses the BLAS 1, 2, and 3 for vector, matrix-vector, and matrix-matrix operations, respectively.

    The limitations of LAPACK for parallel numerical computing are clear.

  • The LAPACK routines typically lead to accuracy measures proportional to the product of matrix order and machine epsilon. A very large matrix order thus translates to loss of accuracy unless extended precision is employed.
  • A large matrix also requires substantial storage. Even a matrix of modest order may takes 8 Mbytes of storage if the matrix is stored with 64-bit numbers. A typical LAPACK routine can also require this amount or more in workspace for an order 1000 problem.
  • Moving matrix blocks between processors is time consuming. Parallel algorithms must be designed to minimize the cost of data movement.
  • In some linear algebra algorithms, such as the bisection method for computing eigenvalues, processes are created at runtime. Keeping the processor load balanced requires an efficient dynamic scheduling mechanism.
  • The big question in parallel linear algebra is the level at which the parallelism should be introduced. Preliminary results indicate that significantly better performance can be obtained by carefully mapping all matrix elements and operations to the particular machine than can be obtained, for example, by using a parallel version of the BLAS. However, matching each code to a particular architecture means that it cannot be ported and that a new code must be developed for each new machine. Given the short lifetimes of modern multiprocessors, this is a highly undesirable situation.

    One solution is to identify building blocks from which more portable yet efficient software can be built.

    Latency and Throughput Optimization

    Large-scale computing using, say, 1000 nodes at 300 GHz, implies 15 cycles/cm. Latency thus becomes a serious problem. One software strategy for handling this latency problem is throughput optimization. By optimizing memory use (i.e., minimizing memory residence time), the programmer can dramatically reduce the latency. Two approaches have been suggested: (1) fine-grained parallelism (one million active processes) and (2) context switching (job swapping while waiting for input/output). Since most efficient data movement is block transfer, context switching within m mory may be optimal.

    Parallel I/O

    Latency raises another issue: Moving data fast implies parallel I/O channels. Currently, I/O is handled by compilers. It need not be, however. A collective I/O interface could be developed to map data flow from n sources to m sinks, to support data redistribution, and to optimize data movement between cache hierarchy.

    Architectures

    Several architectures for PetaFLOPS computer systems were proposed at the workshop. These ranged from enhancements of traditional configurations to radically new alternatives.

    Networks of Parallel Personal Computers

    One proposed scenario for high-performance PetaFLOPS computing is parallel personal computers with up to 1000 processors.

    Symmetric Multiprocessors

    An alternative to networks of personal computers is symmetric multiprocessors.

    Processing-in-Memory Architectures

    A third architecture proposed to meet the PetaFLOPS challenge is based on a processing-in-memory configuration, or PIM.

    Fine-Grained MIMD Systems

    The fourth proposed architecture for a PetaFLOPS computer is based on a single-chip fined-grained MIMD node.

    Performance Modeling

    Application performance was mentioned in most of the presentations given at the Petaflops summer workshop. Performance issues also were discussed in detail during a lengthy session on performance modeling. This section summarizes the various discussions.

    Performance Models: Benefits and Challenges

    Performance models are analytical equations that predict and evaluate the performance (usually in terms of execution time) of an application executing on a given machine or type of machine. Performance models can provide vital information about performance bottlenecks. The programmer may use this information to rewrite or transform an algorithm; the vendor may gain insight into ways of redesigning the hardware for future products.

    The major challenge in developing effective performance models is the derivation and validation of the analytical equations. The derivation of the analytical equations requires an in-depth understanding of the application, the processors used in the machine, the interconnection of the processors, and the granularity of parallelism extracted from the applications (e.g., data versus task). The problem here is that the person who is most knowledgeable about the application (the computational scientist) typically is not the same person who is most knowledgeable about the way the computer system operates (the computer scientist). Moreover, whereas the computational scientist focuses on the results of the application, the computer scientist focuses on core algorithms. These two groups must work more closely: the computer scientist must gain a better understanding of the complete application, not just the core algorithms; and the computational scientist (particularly one working with the so-called killer applications) must exploit the significant work that has been done in the area of core numerical algorithms.

    The validation of the analytical equations requires significant amounts of empirical data from experiments. This data is used to refine the coefficients in the equations. The empirical data is affected by various factors. For uniprocessors, these factors include the operating system, the organization of the memory hierarchy, and the branch prediction and penalty scheme. For multiprocessors, the factors include link contention and congestion, load imbalance, and synchronization effects due to the diameter of the machine.

    For Petaflops computers, performance modeling is made more difficult because of the use of more components, the complexity of the applications, and the varied programming models. Some of the industrial attendees emphasized that current performance models already are breaking down for parallel systems and that it is increasingly difficult and costly to provide performance information for customer applications. Hence, significant research is needed in the area of performance models.

    Performance Modeling Session

    The session dedicated to performance modeling consisted of two talks, one from academia and one from industry. Applications and computer scientists held discussion for approximately three hours. The academic talk focused on the use of various metrics to compare the performance of different parallel algorithms. The industry talk focused on the synergism of core algorithms.

    Performance Metrics

    One metric used for performance modeling is the isoefficiency function, which provides insight into the scalability of a parallel application. The isoefficiency function relates the scaling of the problem to the scaling of the number of processors necessary to maintain a fixed efficiency value. Depending on the complexity of this relationship, one can determine whether the parallel application is good for execution on a large number of processors.

    The isoefficiency function assumes the existence of the speedup equation to derive the isoefficiency function. This speedup equation can be only a gross approximation to performance. For reliable performance evaluation, however, the coefficients are critical. Currently, most parallel performance models comprise only first-order parameters such as computation, communication, and idle times. The computation time generally reflects the number of operations necessary, with some compensation for memory latency. Further research is needed to develop performance models that include second- and third-order machines parameters. In particular, with the use of large number of processors, the network congestion may have a significant effect on the communication time; hence it is desirable for performance models to include such analysis. The machine diameter may affect the synchronization time, and hence the idle time; hence, this parameter also should be included in accurate performance models.

    Finally, research is needed to develop tools for validating and refining the performance models. Such tools include hardware performance monitors, memory monitors, and simulators for the entire memory hierarchy.

    The Breakdown of Performance Modeling

    The industrial speaker emphasized the increasing difficulty of conducting adequate performance modeling. Indeed, the situation was characterized as being on the point of breakdown.

    The difficulty stems from the following situation. Industry is asked to measure (or predict) the performance for benchmark applications provided by the customer. The performance of the supplied customer benchmark is projected by industry based upon the performance of well-defined, standard, but isolated core algorithms. Such projections are subject to erroneous performance estimates, unaccounted large system effects, and undetected systems errors and bugs.

    The discussions following this talk focused on the need to develop a type of ``performance calculus" that can accurately predict the performance of full applications from the synergism of core algorithms. Specifically, research is needed to identify the interrelationship of various core algorithms used in the same application. The applications requiring high-performance parallel systems are often very complex, entailing core algorithms from various engineering disciplines.

    Application Performance

    An almost universal theme in the applications presentations was the poor performance (relative to so-called peak performance) currently attainable on parallel machines. The discussions brought out various issues regarding methods for improving the performance.

    For example, in research on controlled magnetic fusion, an I/O traffic of 2.7 petabytes to disk in 8.1 hours of execution time is anticipated for magnetofluid calculations, with 1,200 grid points and 36,000 modes on a Petaflops machine. File sizes of 1.6 gigabytes will be written to disk every time step, for an aggregate I/O throughput of 93 gigabytes/sec. This I/O capacity needs to be accommodated in the design of a Petaflops machine.

    As another example, in research on automatic test pattern generation (the process of generating a set of input patterns for a digital device that are sufficient to detect all physical defects in the device that may be present in it after it is manufactured), most algorithms are latency limited. Scientists would like access to a machine that is able to pass medium-sized packets of data (approximately 1--10 kilobytes) with very low latencies regardless of the number of processors.

    With specific regard to uniprocessor performance, questions were raised concerning the application and the processor architecture. In terms of the application, the following factors were identified as important:


    In terms of processor architecture, the following factors were identified:
    With specific regard to parallel performance, questions related to the application and machine were also discussed. In terms of the parallel application, the following factors were identified:
    In terms of the parallel machine, the following factors were identified:
    It was agreed that a team of computer scientists and application scientists should conduct in-depth case studies to determine the relative effect of these factors---independent of the parallel machine and the programming model.

    Findings

    Performance models can play a useful role in assessing and predicting the performance of an application. For effective use for Petaflops applications, however, researchers must carry out the following tasks:
      1. Develop performance models that incorporate second- and third-order machines parameters, in particular, analysis of network congestion and synchronization time resulting from machine diameter.

      2. Design tools for validating and refining the performance models; these tools include hardware performance monitors and memory monitors for the entire memory hierarchy.

      3. Analyze the interrelationship of core algorithms taken from various disciplines and used in the same application.

      4. Devise a methodology or technique for using the detailed model of a core algorithm to predict the performance of a full application.


    Major Findings

    The PetaFLOPS Summer Workshop provided the most detailed examination to date of the opportunities for application of petaFLOPS-scale computing to important problems in science, engineering, information management, and commerce. It exposed the computational requirements and scaling properties of a number of programs and related these to the characteristics and capabilities of proposed architectures.

    This section captures the principal findings of the workshop with brief explanations.

      1. Many significant applications have been identified that either demand or would greatly benefit from computers capable of petaFLOPS performance.

      2. PetaFLOPS computer systems will be realizable through enhanced natural evolution of technology and systems architecture in 20 to 25 years using largely commercially available components. But processor architectures will require enhancements for broadest applicability at petaFLOPS scale, which include

      3. Latency will be a major factor in determining performance efficiency. Latency management in hardware and software will be a pacing element in achieving sustained petaFLOPS performance for a broad range of real-world problems.

      4. Memory capacity requirements vary dramatically across the array of important petaFLOPS applications.

      5. Limitations of parallelism in some petaFLOPS-scale applications will require system architectures to run at the highest possible clock rates within the constraints of economic viability. Having fewer higher-performance processors is preferred in lieu of a greater number of light-weight processors.

      6. PetaFLOPS architectures may be feasible in a decade, at least for domain-specific applications, if alternative approaches other than those based on commercial microprocessor technology are pursued.

      7. Proposed petaFLOPS architectures are insufficiently well defined and characterized to enable detailed applications studies to be conducted.

      8. PetaFLOPS application programming will require formalisms incorporating the necessary constructs for better application-driven management of resources to circumvent latencies, expose rich forms of parallelism, and balance workloads.


    Preliminary Recommendations

    The PetaFLOPS Applications Workshop determined the availability of important candidate applications for PetaFLOPS computing and validated the feasibility of implementing a petaFLOPS architecture in the twenty-year timeframe. Nevertheless, many questions remain concerning efficiency, programmability, cost, nearer-term opportunities, generality, and other issues related to the practical exploitation of petaFLOPS. Some recommendations central to advancing the state of petaFLOPS computing as it is currently conceived are provided in this section.

      1. Perform further applications studies to

      A preliminary list of limiting factors includes the following:

      + degree of parallelism,
      + nonparallel (scalar) regions,
      + the critical path length of the code
      + the mix of instruction types (memory reference, data operation),
      + data movement and access patterns,
      + critical, low-latency areas for memory access,
      + critical, high-bandwidth requirements for memory access,
      + need for barrier synchronization,
      + data layout at run time and prediction of changes in data locality as computation proceeds,
      + overall scalability (presuming increases in memory size or CPU performance), and
      + need for 128-bit precision.

      2. Support research into specific advanced architecture techniques:

      3. Sponsor development exploration in advanced technologies:

      4. Conduct point design studies of petaFLOPS architectures that are more detailed than the current architecture sketches. These point designs should be accompanied with simulators and possibly prototypes that are capable of running key target applications. Such designs should relax conventional assumptions concerning system structure and constituent elements, including the use of microprocessors, program counter-driven flow control, classic memory hierarchy, and logic/byte balance. The designs should be carried out by teams with expertise in both architecture and applications. Four different system types should be considered:

      5. Develop performance models to determine suitability and predict performance across the array of architectures in support of identified petaFLOPS applications. The models should include ``applications performance models,'' which can help designers evaluate the need for so-called fat (high-performance CPUs, high memory) nodes, and ``machine models,'' which describe memory size, hierarchy, and management scheme; ability to multithread or overlap serial code execution; special instructions to reduce computation load; communications architecture and capabilities; and I/O architecture. The two types of model should be used together to determine the overall match of a particular architecture to a specific application.

      6. Fund advanced system software research in runtime systems and innovative compiler-driven resource management heuristics. Basic issues include

      7. Sponsor additional workshops in architecture, algorithms, and system software related to petaFLOPS computing to extend the basis of consideration and expand the research community. Issues to be addressed should include


      8. Establish a standing petaFLOPS oversight committee to monitor ongoing research, assess progress, and recommend new initiatives.

    References

    1. D. H. Bailey, J. M. Borwein, and R. D. Girgensohn, ``Experimental Evaluation of Euler Sums,'' Experimental Mathematics, Fall 1994.

    2. Decyk, Viktor K., University of California at Los Angeles, personal communication, 1995.

    3. Bylaska, Kohn, Baden, Edelman, Kawai, Ong, and Weare, 1995.

    4. Kohn, dissertation, 1995.

    5. MacAdam and Shapiro, Science, 1995

    6. Marino, M. M., W. C. Ermler, C. W. Kern, and V. E. Bondybey, J. Chem. Phys. 96, 3756 (1992).

    7. Report of the Committee on Mathematical Challenges from Quantum Chemistry of the Board on Mathematical Sciences and Board on Chemical Sciences and Technology, National Academy Press, Washington, D.C., 1995.

    8. Sterling, T., P. Messina, and P. H. Smith. Enabling Technologies for PetaFLOPS Computing, The MIT Press, Cambridge, Mass., 1995

    9. Wall Street Journal August 18, 1995

    Additional Information

    Additional information about the PetaFLOPS II workshop is available as follows:



    Additional Information

    Additional information about the PetaFLOPS II workshop is available as follows: