DRAFT - Feb. 1996
PetaFLOPS computing---computing equal to a million billion floating-point operations per second---is expected to be feasible within the next twenty years. Ensuring that what may be becomes and usable presents major technical challenges. To provide a forum for discussion of those challenges, the workshop on ``Applications and Algorithm Challenges for PetaFLOPS Computing' ' was held on August 14--23, 1995, in Bodega Bay, California.
The workshop was the second in a series of meetings that started in Pasadena, California, in 1994. The first workshop focused on the technologies needed for PetaFLOPS computing systems. Complementing and expanding on that initial effort, the 1995 workshop centered on innovative applications and algorithms for PetaFLOPS systems. Sponsored by four federal agencies---ARPA, DOE, NASA, NSA, and NSF---this workshop brought together a dynamic mix of established and emerging researchers from government, industry, academia, and national laboratories.
The primary goal of this workshop was to develop a firmer understanding of the types of applications PetaFLOPS will make possible and what the barriers and technological research issues are. A secondary goal was to provide a roadmap for algorithm and applications development over the next five, ten, and twenty years.
The following objectives were identified to achieve these goals:
As a first step toward establishing detailed case studies, the participants prepared an alphabetical list categorizing scientific and engineering disciplines that could benefit from access to a petaFLOPS computer.
Subsequently, for each discipline, specific requirements for such factors as memory, I/O, and parallelism were calculated and listed. This list is organized by runtime requirements at 1 PetaFLOP.
Representative examples from the lists were then expanded into detailed case studies.
In some applications areas, algorithmic improvements have yielded gains comparable to the gains from faster hardware, Algorithms are extremely important for two other reasons:
Based on an analysis of the individual application case studies, we were able to prepare a list of algorithms that an effective petaFLOPS computer will need to handle.
While the principal focus of this workshop was on applications, a necessary adjunct to the discussions was software. The typical applications program currently runs at approximately 15% efficiency, and there is little indication that the situation will improve as problems and machines become larger. Thus, the issue was raised about how algorithms designers and software developers can help ease the job of the applications programmer and (one hopes) make high-performance codes more efficient.
One solution is to develop a better parallel programming language. Currently, no ``perfect'' parallel programming language exists. Several advances have been made, most notably Fortran M and pC++. Nevertheless, there is no standard paradigm. Low-level approaches provide inadequate support for applications, and high-level approaches provide too little handle on performance. Moreover, important questions remain:
Another software approach voiced by the workshop participants was that of modularity and reuse of code. At least one group felt that there is a level of modularity at which the code can be easy to design and can be reasonably efficient. The suggestion was made, therefore, that the petaFLOPS initiative include research into common patterns, with the objective of determining the most appropriate level of code modularity.
Related to the use of modules is the use of portable building blocks. Portable building blocks seem necessary particular for large-scale matrix problems that arise in numerical linear algebra.
Several packages exist for handling numerical linear algebra problems. These include EISPACK, a package of routines for solving eigenvalue problems released in 1976; LINPACK, a package of routines for solving linear systems and the singular value decomposition, released in 1979; and LAPACK, a package released in 1992 that uses the BLAS 1, 2, and 3 for vector, matrix-vector, and matrix-matrix operations, respectively.
The limitations of LAPACK for parallel numerical computing are clear.
The big question in parallel linear algebra is the level at which the parallelism should be introduced. Preliminary results indicate that significantly better performance can be obtained by carefully mapping all matrix elements and operations to the particular machine than can be obtained, for example, by using a parallel version of the BLAS. However, matching each code to a particular architecture means that it cannot be ported and that a new code must be developed for each new machine. Given the short lifetimes of modern multiprocessors, this is a highly undesirable situation.
One solution is to identify building blocks from which more portable yet efficient software can be built.
Large-scale computing using, say, 1000 nodes at 300 GHz, implies 15 cycles/cm. Latency thus becomes a serious problem. One software strategy for handling this latency problem is throughput optimization. By optimizing memory use (i.e., minimizing memory residence time), the programmer can dramatically reduce the latency. Two approaches have been suggested: (1) fine-grained parallelism (one million active processes) and (2) context switching (job swapping while waiting for input/output). Since most efficient data movement is block transfer, context switching within m mory may be optimal.
Latency raises another issue: Moving data fast implies parallel I/O channels. Currently, I/O is handled by compilers. It need not be, however. A collective I/O interface could be developed to map data flow from n sources to m sinks, to support data redistribution, and to optimize data movement between cache hierarchy.
Several architectures for PetaFLOPS computer systems were proposed at the workshop. These ranged from enhancements of traditional configurations to radically new alternatives.
One proposed scenario for high-performance PetaFLOPS computing is parallel personal computers with up to 1000 processors.
An alternative to networks of personal computers is symmetric multiprocessors.
A third architecture proposed to meet the PetaFLOPS challenge is based on a processing-in-memory configuration, or PIM.
The fourth proposed architecture for a PetaFLOPS computer is based on a single-chip fined-grained MIMD node.
Application performance was mentioned in most of the presentations given at the Petaflops summer workshop. Performance issues also were discussed in detail during a lengthy session on performance modeling. This section summarizes the various discussions.
The major challenge in developing effective performance models is the derivation and validation of the analytical equations. The derivation of the analytical equations requires an in-depth understanding of the application, the processors used in the machine, the interconnection of the processors, and the granularity of parallelism extracted from the applications (e.g., data versus task). The problem here is that the person who is most knowledgeable about the application (the computational scientist) typically is not the same person who is most knowledgeable about the way the computer system operates (the computer scientist). Moreover, whereas the computational scientist focuses on the results of the application, the computer scientist focuses on core algorithms. These two groups must work more closely: the computer scientist must gain a better understanding of the complete application, not just the core algorithms; and the computational scientist (particularly one working with the so-called killer applications) must exploit the significant work that has been done in the area of core numerical algorithms.
The validation of the analytical equations requires significant amounts of empirical data from experiments. This data is used to refine the coefficients in the equations. The empirical data is affected by various factors. For uniprocessors, these factors include the operating system, the organization of the memory hierarchy, and the branch prediction and penalty scheme. For multiprocessors, the factors include link contention and congestion, load imbalance, and synchronization effects due to the diameter of the machine.
For Petaflops computers, performance modeling is made more difficult because of the use of more components, the complexity of the applications, and the varied programming models. Some of the industrial attendees emphasized that current performance models already are breaking down for parallel systems and that it is increasingly difficult and costly to provide performance information for customer applications. Hence, significant research is needed in the area of performance models.
One metric used for performance modeling is the isoefficiency function, which provides insight into the scalability of a parallel application. The isoefficiency function relates the scaling of the problem to the scaling of the number of processors necessary to maintain a fixed efficiency value. Depending on the complexity of this relationship, one can determine whether the parallel application is good for execution on a large number of processors.
The isoefficiency function assumes the existence of the speedup equation to derive the isoefficiency function. This speedup equation can be only a gross approximation to performance. For reliable performance evaluation, however, the coefficients are critical. Currently, most parallel performance models comprise only first-order parameters such as computation, communication, and idle times. The computation time generally reflects the number of operations necessary, with some compensation for memory latency. Further research is needed to develop performance models that include second- and third-order machines parameters. In particular, with the use of large number of processors, the network congestion may have a significant effect on the communication time; hence it is desirable for performance models to include such analysis. The machine diameter may affect the synchronization time, and hence the idle time; hence, this parameter also should be included in accurate performance models.
Finally, research is needed to develop tools for validating and refining the performance models. Such tools include hardware performance monitors, memory monitors, and simulators for the entire memory hierarchy.
The Breakdown of Performance Modeling
The industrial speaker emphasized the increasing difficulty of conducting adequate performance modeling. Indeed, the situation was characterized as being on the point of breakdown.
The difficulty stems from the following situation. Industry is asked to measure (or predict) the performance for benchmark applications provided by the customer. The performance of the supplied customer benchmark is projected by industry based upon the performance of well-defined, standard, but isolated core algorithms. Such projections are subject to erroneous performance estimates, unaccounted large system effects, and undetected systems errors and bugs.
The discussions following this talk focused on the need to develop a type of ``performance calculus" that can accurately predict the performance of full applications from the synergism of core algorithms. Specifically, research is needed to identify the interrelationship of various core algorithms used in the same application. The applications requiring high-performance parallel systems are often very complex, entailing core algorithms from various engineering disciplines.
For example, in research on controlled magnetic fusion, an I/O traffic of 2.7 petabytes to disk in 8.1 hours of execution time is anticipated for magnetofluid calculations, with 1,200 grid points and 36,000 modes on a Petaflops machine. File sizes of 1.6 gigabytes will be written to disk every time step, for an aggregate I/O throughput of 93 gigabytes/sec. This I/O capacity needs to be accommodated in the design of a Petaflops machine.
As another example, in research on automatic test pattern generation (the process of generating a set of input patterns for a digital device that are sufficient to detect all physical defects in the device that may be present in it after it is manufactured), most algorithms are latency limited. Scientists would like access to a machine that is able to pass medium-sized packets of data (approximately 1--10 kilobytes) with very low latencies regardless of the number of processors.
With specific regard to uniprocessor performance, questions were raised concerning the application and the processor architecture. In terms of the application, the following factors were identified as important:
2. Design tools for validating and refining the performance models; these tools include hardware performance monitors and memory monitors for the entire memory hierarchy.
3. Analyze the interrelationship of core algorithms taken from various disciplines and used in the same application.
4. Devise a methodology or technique for using the detailed model of a core algorithm to predict the performance of a full application.
The PetaFLOPS Summer Workshop provided the most detailed examination to date of the opportunities for application of petaFLOPS-scale computing to important problems in science, engineering, information management, and commerce. It exposed the computational requirements and scaling properties of a number of programs and related these to the characteristics and capabilities of proposed architectures.
This section captures the principal findings of the workshop with brief explanations.
2. PetaFLOPS computer systems will be realizable through enhanced natural evolution of technology and systems architecture in 20 to 25 years using largely commercially available components. But processor architectures will require enhancements for broadest applicability at petaFLOPS scale, which include
3. Latency will be a major factor in determining performance efficiency. Latency management in hardware and software will be a pacing element in achieving sustained petaFLOPS performance for a broad range of real-world problems.
4. Memory capacity requirements vary dramatically across the array of important petaFLOPS applications.
6. PetaFLOPS architectures may be feasible in a decade, at least for domain-specific applications, if alternative approaches other than those based on commercial microprocessor technology are pursued.
7. Proposed petaFLOPS architectures are insufficiently well defined and characterized to enable detailed applications studies to be conducted.
8. PetaFLOPS application programming will require formalisms incorporating the necessary constructs for better application-driven management of resources to circumvent latencies, expose rich forms of parallelism, and balance workloads.
The PetaFLOPS Applications Workshop determined the availability of important candidate applications for PetaFLOPS computing and validated the feasibility of implementing a petaFLOPS architecture in the twenty-year timeframe. Nevertheless, many questions remain concerning efficiency, programmability, cost, nearer-term opportunities, generality, and other issues related to the practical exploitation of petaFLOPS. Some recommendations central to advancing the state of petaFLOPS computing as it is currently conceived are provided in this section.
A preliminary list of limiting factors includes the following:
+ degree of parallelism,
+ nonparallel (scalar) regions,
+ the critical path length of the code
+ the mix of instruction types (memory reference, data operation),
+ data movement and access patterns,
+ critical, low-latency areas for memory access,
+ critical, high-bandwidth requirements for memory access,
+ need for barrier synchronization,
+ data layout at run time and prediction of changes in data locality as computation proceeds,
+ overall scalability (presuming increases in memory size or CPU performance), and
+ need for 128-bit precision.
2. Support research into specific advanced architecture techniques:
4. Conduct point design studies of petaFLOPS architectures that are more detailed than the current architecture sketches. These point designs should be accompanied with simulators and possibly prototypes that are capable of running key target applications. Such designs should relax conventional assumptions concerning system structure and constituent elements, including the use of microprocessors, program counter-driven flow control, classic memory hierarchy, and logic/byte balance. The designs should be carried out by teams with expertise in both architecture and applications. Four different system types should be considered:
5. Develop performance models to determine suitability and predict performance across the array of architectures in support of identified petaFLOPS applications. The models should include ``applications performance models,'' which can help designers evaluate the need for so-called fat (high-performance CPUs, high memory) nodes, and ``machine models,'' which describe memory size, hierarchy, and management scheme; ability to multithread or overlap serial code execution; special instructions to reduce computation load; communications architecture and capabilities; and I/O architecture. The two types of model should be used together to determine the overall match of a particular architecture to a specific application.
6. Fund advanced system software research in runtime systems and innovative compiler-driven resource management heuristics. Basic issues include
7. Sponsor additional workshops in architecture, algorithms, and system software related to petaFLOPS computing to extend the basis of consideration and expand the research community. Issues to be addressed should include
8. Establish a standing petaFLOPS oversight committee to monitor ongoing research, assess progress, and recommend new initiatives.
1. D. H. Bailey, J. M. Borwein, and R. D. Girgensohn, ``Experimental Evaluation of Euler Sums,'' Experimental Mathematics, Fall 1994.
2. Decyk, Viktor K., University of California at Los Angeles, personal communication, 1995.
3. Bylaska, Kohn, Baden, Edelman, Kawai, Ong, and Weare, 1995.
4. Kohn, dissertation, 1995.
5. MacAdam and Shapiro, Science, 1995
6. Marino, M. M., W. C. Ermler, C. W. Kern, and V. E. Bondybey, J. Chem. Phys. 96, 3756 (1992).
7. Report of the Committee on Mathematical Challenges from Quantum Chemistry of the Board on Mathematical Sciences and Board on Chemical Sciences and Technology, National Academy Press, Washington, D.C., 1995.
8. Sterling, T., P. Messina, and P. H. Smith. Enabling Technologies for PetaFLOPS Computing, The MIT Press, Cambridge, Mass., 1995
9. Wall Street Journal August 18, 1995
Additional information about the PetaFLOPS II workshop is available as follows:
Additional information about the PetaFLOPS II workshop is available as follows: