Applications and Algorithm Challenges for PetaFLOPS Computing

 <H1><b> Applications and
Algorithm Challenges for PetaFLOPS Computing</b>
</H1>
<P>
<b> DRAFT - Feb. 1996 </b>
<P>
PetaFLOPS computing---computing equal to a million billion floating-point
operations per second---is expected to be feasible within the next twenty
years.
Ensuring that what may be becomes and usable
presents major technical challenges.
To provide a forum for discussion of those challenges,
the workshop on ``Applications and Algorithm Challenges for PetaFLOPS Computing'
'
was
held on August 14--23, 1995, in Bodega Bay, California.
<P>
The workshop was the second in a series of
meetings that started in Pasadena, California, in 1994.
The first workshop focused on the technologies needed for PetaFLOPS computing
systems.
Complementing and expanding on that initial effort,
the 1995 workshop centered on innovative applications
and algorithms for PetaFLOPS systems.
Sponsored by four federal agencies---ARPA, DOE, NASA, NSA, and NSF---this
workshop brought together a dynamic mix of established
and emerging researchers
from government, industry, academia, and national laboratories.
<P>
<H2>Goal</h2>
<P>
The primary goal of this workshop was to
develop a firmer
understanding of the types of applications PetaFLOPS will make
possible and what the barriers and technological research issues are.
A secondary goal was to provide a roadmap
for algorithm and applications development
over the next
five, ten, and twenty years.
<P>
<H2>Objectives</h2>
<P>
The following objectives were identified to achieve these goals:
<P>
<dd><img alt="+" src="http://www.mcs.anl.gov/Images/greenball.gif">
<em> Develop Case Studies.</em>
Analyze
applications case studies
and develop
performance models of
applications areas that may be
enabled by PetaFLOPS systems.
Determine the
algorithm
and software requirements, and identify the
challenges to be overcome.
<P>
<dd><img alt="+" src="http://www.mcs.anl.gov/Images/greenball.gif">
<em> Describe Open Problems in Algorithms.</em>
Develop
specific descriptions of unsolved algorithmic problems.
Devise
possible plans of attack, with
estimates of the importance of each problem.
<P>
<dd><img alt="+" src="http://www.mcs.anl.gov/Images/greenball.gif">
<em> Prepare an Applications Development Roadmap.</em>
Establish a multiyear program for research in
algorithm and application technologies.
The program should be
divided into near-term (5-year), middle-term (10-year), and
long-term (20-year) plans.
<P>
<dd><img alt="+" src="http://www.mcs.anl.gov/Images/greenball.gif">
<em> Recommend a Research Program.</em>
Prepare a detailed agenda
of the technical skills and resources
needed to solve the algorithmic issues
in order to
enable the various petaFLOPS applications
identified.
Prepare a rationale justifying the agenda, from
both the political and the scientific viewpoint.
<P>
<h2>Applications Case Studies</h2>
<P>
As a first step toward establishing detailed case studies, the
participants prepared an alphabetical list categorizing
<A href="http://www.mcs.anl.gov/summer.study/categories/categories.html">
scientific and engineering disciplines </a> that
could benefit from access to a petaFLOPS
computer.
<P>
Subsequently, for each discipline,
<a href="http://www.mcs.anl.gov/summer.study/project-chart/project-chart.html">
specific requirements </a>
for such factors as memory, I/O, and parallelism were calculated
and listed.
This list is organized by runtime requirements at 1 PetaFLOP.
<P>
Representative examples from the lists were then expanded into
detailed
<a href="http://www.mcs.anl.gov/summer.study/case-studies/index.html">
case studies.</a>
<P>
<h2>Algorithms</h2>
<P>
In some applications areas,
algorithmic improvements have yielded gains comparable to the gains from faster hardware,
Algorithms are extremely important
for two other reasons:
<MENU>
<LI>
Algorithms strongly constrain the choice of architectures and software
environments.
<LI>
Algorithms can partially compensate for architectural deficiencies.
</menu>
<P>
Based on an
analysis of the individual application case studies,
we were able to prepare a list of
algorithms that
an effective petaFLOPS computer
will need to handle.
<MENU>
<LI>
block tridiagonal solvers - fusion
<LI>
configuration interaction (CI) methods - computational chemistry
<LI>
convolution algorithms - fusion
<LI>
coupled clusters - computational chemistry
<LI>
domain decomposition - climate modeling
<LI>
fast Fourier transforms (3D, real-to-complex) - fusion
<LI>
finite difference mesh - climate systems modeling
<LI>
finite-element unstructured grid - chemically reacting flows
<LI>
full configuration interaction methods (FCI) - computational chemistry
<LI>
fully coupled implicit iterative solvers - chemically reacting flows
<LI>
hierarchical methods -- for telesurgery
<LI>
integer relation algorithms - automatic discovery of mathematical theorems
<LI>
maximum likelihood -- reconstruction of DNA of extinct species
<LI>
multigrid algorithms -- for particle dynamics, material design, stress modeling, convective turbulence
<LI>
partitioning algorithms - theorem proving
<LI>
perturbation theory (PT) - computational chemistry
<LI>
piecewise parabolic methods - astrophysics
<LI>
search space parallelization - theorem proving
<LI>
self-consistent field methods - computational chemistry
<LI>
singular value image decomposition - the all-earth survey
<LI>
spectral transform method - climate systems modeling
<LI>
table lookup and pattern matching - expert systems
</menu>
<P>
<h2> Software </h2>
<P>
While the principal focus of this workshop was on applications,
a necessary adjunct to the discussions was software.
The typical applications program currently runs
at approximately 15% efficiency, and there is little indication
that the situation will improve as problems and machines
become larger.
Thus, the issue was raised about how
algorithms designers and software developers can
help
ease the job of the applications
programmer and (one hopes) make high-performance codes more <em> efficient.</em>
<P>
<h3>
<b> Programming Languages
</h3>
</b>
<P>
One solution is to
develop a better parallel programming language.
Currently, no ``perfect'' parallel programming language exists.
Several advances have been made, most notably Fortran M and pC++.
Nevertheless, there is no standard paradigm.
Low-level approaches provide inadequate support for applications,
and high-level approaches provide too little handle on performance.
Moreover, important questions remain:
<P>
<MENU>
<LI>
How can we better identify global state and determine which module
last processed the data?
<LI>
Should the programming model allow only remote access to stationary states?
<LI>
Should debuggers analyze state transitions?
<LI>
Should all data be tagged with state?
</MENU>
<P>
<h3>
<b> Templates</b>
</h3>
<P>
Another software approach voiced
by the workshop participants
was that of <em>modularity</em> and reuse of code.
At least one group felt that
there is a level of modularity at which the code can be easy
to design and can be reasonably efficient.
The suggestion was made, therefore, that
the petaFLOPS initiative
include research into
common patterns, with the objective of determining the
most appropriate level of code modularity.
<P>
<h3>
<b> Building Blocks</b>
</h3>
<P>
Related to the use of
modules is the use of portable building blocks.
Portable building blocks seem necessary particular for
large-scale matrix problems that arise in
numerical linear algebra.
<P>
Several packages exist for handling numerical linear algebra problems.
These include
EISPACK, a package of routines for solving eigenvalue problems
released in 1976; LINPACK, a package of routines for solving
linear systems and the singular value decomposition, released
in 1979; and
LAPACK, a package released in 1992 that
uses the BLAS 1, 2, and 3 for vector,
matrix-vector, and matrix-matrix operations, respectively.
<P>
The limitations of LAPACK for parallel numerical computing are
clear.
<P>
<MENU>
<LI>
The LAPACK routines typically lead to
accuracy measures proportional to the product of matrix order and
machine epsilon. A very large matrix order thus translates to
loss of accuracy unless extended precision is employed.
<LI>
A large
matrix also requires substantial storage. Even a
matrix of modest order
may takes 8 Mbytes of storage if the
matrix is stored with 64-bit numbers. A typical LAPACK routine can
also require this amount or more in workspace for an order 1000 problem.
<LI>
Moving matrix blocks between processors is
time consuming.
Parallel algorithms
must be designed
to minimize the cost of data movement.
<LI>
In some linear algebra algorithms, such as the bisection method
for computing eigenvalues, processes are created at runtime. Keeping
the processor load balanced requires an efficient dynamic scheduling
mechanism.
</MENU>
<P>
The big question in parallel linear algebra is the level at
which the parallelism should be introduced. Preliminary results
indicate that significantly better performance can be obtained
by carefully mapping all matrix elements and operations to the
particular machine than can be obtained, for example, by using a
parallel version of the BLAS. However, matching each code to a
particular architecture means that it cannot be ported and that
a new code must be developed for each new machine. Given the
short lifetimes of modern multiprocessors, this is a highly
undesirable situation.
<P>
One solution is to identify
building blocks from which more portable
yet efficient software can be built.
<P>
<h3>
<b>Latency and Throughput Optimization</b>
</h3>
<P>
Large-scale computing
using, say, 1000 nodes at 300 GHz, implies 15 cycles/cm.
Latency thus becomes a serious problem.
One software strategy for handling this latency problem is throughput optimization.
By optimizing memory use (i.e., minimizing memory residence time),
the programmer can dramatically reduce the latency.
Two approaches have been suggested:
(1) fine-grained parallelism (one million active processes)
and (2) context switching (job swapping while waiting for input/output).
Since most efficient data movement is block transfer,
context switching within m
mory may be optimal.
<P>
<h3>
<b> Parallel I/O
</h3>
</b>
<P>
Latency raises another issue:
Moving data fast implies parallel I/O channels.
Currently, I/O is handled
by compilers.
It need not be, however.
A collective I/O interface
could be developed to map data flow from <em> n</em> sources to <em>m </em>sinks,
to support data redistribution, and to optimize data movement between
cache hierarchy.
<h2>Architectures</h2>
<P>
Several architectures for PetaFLOPS computer systems
were proposed
at the workshop.
These ranged from enhancements of traditional configurations
to radically new
alternatives.
<P>
<h3>
<b> Networks of Parallel Personal Computers</b></H3>
<P>
One
proposed scenario
for high-performance PetaFLOPS computing
is
<a href="http://www.mcs.anl.gov/summer.study/smith/smith.html">
parallel personal computers with up to 1000 processors</a>.
<P>
<h3>
<b> Symmetric Multiprocessors</b></h3>
<P>
An alternative to networks of personal computers is
<A href="http://www.mcs.anl.gov/summer.study/tilak/tilak.html">symmetric multiprocessors.</a>
<P>
<H3>
<b> Processing-in-Memory Architectures</b></H3>
<P>
A third architecture proposed to meet the PetaFLOPS challenge
is based on
a processing-in-memory configuration, or
<A href="http://www.mcs.anl.gov/summer.study/pim/pim.html">PIM</a>.
<P>
<H3>
<b>
Fine-Grained MIMD Systems</b>
</h3>
<P>
The fourth proposed architecture for a PetaFLOPS computer
is based on a
<a href="http://www.mcs.anl.gov/summer.study/nelson/nelson.html">
single-chip fined-grained MIMD node.</a>
<h2>
<b> Performance Modeling</b>
</h2>
<P>
Application performance was mentioned in most of the
presentations given at the Petaflops summer workshop. Performance
issues also were discussed in detail during a lengthy session on performance
modeling. This section summarizes the various discussions.
<P>
<h3>
Performance Models: Benefits and Challenges</H3>
Performance models are analytical equations that predict and evaluate the
performance (usually in terms of execution time) of an application executing on a given machine or type of machine.
Performance models can provide
vital information about performance bottlenecks.
The programmer may use this information to
rewrite or transform an algorithm;
the vendor may gain insight into ways
of redesigning the hardware for future products.
<P>
The major
challenge in developing effective performance models is
the derivation and validation of the
analytical equations.
The <em> derivation</em> of the analytical equations requires an in-depth
understanding of the application, the processors used in the machine, the
interconnection of the processors, and the granularity of parallelism extracted
from the applications (e.g., data versus task).
The problem here is that the person who is most knowledgeable about the application
(the computational scientist)
typically is <em> not</em>
the same person who is most knowledgeable about the way the computer system
operates (the computer scientist).
Moreover, whereas
the computational scientist focuses on the results of the application,
the computer scientist focuses on core algorithms.
<em>
These two groups must work more closely:</em>
the computer scientist must
gain a better understanding of
the complete application, not just the core algorithms; and
the computational scientist
(particularly one working with the so-called killer applications)
must exploit the
significant work that has been done in the area of core
numerical algorithms.
<P>
The <em> validation</em> of the analytical equations requires significant amounts of
empirical data from experiments. This data is used to refine the
coefficients in the equations. The empirical data
is affected by various factors.
For uniprocessors, these factors include
the operating system,
the organization of the memory hierarchy, and
the branch prediction and penalty scheme. For multiprocessors,
the factors include link contention
and congestion, load imbalance, and
synchronization effects due to the diameter of the machine.
<P>
For
Petaflops computers,
performance modeling is made more difficult because of the
use of more components, the complexity of the
applications, and the varied programming
models. Some of the industrial attendees emphasized that
current performance models already are breaking down for parallel systems and that
it is increasingly difficult and costly to provide performance information for
customer applications. Hence, significant research is needed in the area
of performance models.
<P>
<H3>Performance Modeling Session</H3>
The session dedicated to performance modeling consisted of two talks, one
from academia and one from industry. Applications
and computer scientists held discussion for approximately three hours.
The academic talk focused on the use of various metrics to compare the
performance of different parallel algorithms.
The industry talk focused on the synergism of core algorithms.
<P>
<b>Performance Metrics</b>
<P>
One metric used for performance modeling is
the <em> isoefficiency function,</em> which provides insight into the
scalability of a parallel application. The isoefficiency function
relates the scaling of the problem to the scaling of the number of processors
necessary to maintain a fixed efficiency value. Depending on the
complexity of this relationship, one can determine whether the parallel
application is good for execution on a large number of processors.
<P>
The isoefficiency
function assumes the existence of the speedup equation to derive the
isoefficiency function. This speedup equation can be only a gross
approximation to performance.
For reliable performance evaluation, however, the coefficients are critical.
Currently, most parallel performance models comprise only first-order
parameters such as computation, communication, and idle times. The
computation time generally reflects the number of operations necessary,
with some compensation for memory latency.
Further research is needed to develop
performance models that include second- and third-order machines
parameters.
In particular,
with the use of
large number of processors, the network congestion may have a significant
effect on the communication time;
hence it is desirable for performance
models to include such analysis. The machine diameter may affect the
synchronization time, and hence the idle time; hence, this
parameter also should be included in accurate performance models.
<P>
Finally, research is needed
to develop
tools for validating and refining the performance models.
Such tools include hardware performance monitors, memory monitors, and
simulators for the entire memory hierarchy.
<P>
<b>The Breakdown of Performance Modeling</b>
<P>
The industrial speaker
emphasized
the increasing difficulty
of conducting adequate performance modeling.
Indeed, the situation was characterized as being on the
point of breakdown.
<P>
The difficulty stems from the following situation.
Industry is asked to
measure (or predict) the performance
for benchmark applications
provided by the customer.
The performance of the supplied
customer benchmark is projected by industry based upon the performance of
well-defined, standard, but isolated core algorithms.
Such projections are subject to erroneous
performance estimates, unaccounted large system effects, and undetected
systems errors and bugs.
<P>
The discussions following this talk focused on the need to develop a type
of ``performance calculus" that can accurately predict the performance of full
applications
from the <em> synergism</em> of core algorithms.
Specifically, research is needed to
identify the interrelationship of various core algorithms used in the
same application. The applications requiring high-performance parallel
systems are often very complex, entailing core algorithms from various
engineering disciplines.
<P>
<H3>Application Performance</H3>
An almost universal theme in the
applications presentations was the
poor performance
(relative to so-called peak performance)
currently attainable on parallel machines.
The discussions
brought out various issues regarding
methods for improving the performance.
<P>
For example,
in research on controlled magnetic fusion,
an I/O traffic of 2.7 petabytes to disk in 8.1 hours of execution time is
anticipated for magnetofluid calculations,
with 1,200 grid points and 36,000 modes on a Petaflops machine.
File sizes of 1.6 gigabytes will be written to disk every time step, for
an aggregate I/O throughput of 93 gigabytes/sec.
This <em> I/O capacity</em> needs to be accommodated in the design of a Petaflops machine.
<P>
As another example,
in research on
automatic test pattern generation
(the process of generating a set of input patterns for a digital device
that are sufficient to detect all physical defects in the device that may
be present in it after it is manufactured),
most algorithms are latency limited.
Scientists would like access to
a machine that is able to pass medium-sized packets
of data (approximately 1--10 kilobytes) with <em> very low latencies</em>
regardless of the number of processors.
<P>
With specific regard
to <em> uniprocessor performance,</em>
questions were raised concerning
the application and the processor architecture. In terms of the
application, the following
factors were identified as important:
<ul>
<li>total memory requirements
<li>working set size
<li>data structures used
<li>operand size
(32-bits, 64-bits)
<li>description of core algorithms.
</ul>
<BR>
In terms of
processor architecture, the following factors were identified:
<ul>
<li>amount of functional parallelism available in the processor
<li>memory bandwidth
<li>description of memory hierarchy
</ul>
<BR>
With specific regard to <em> parallel performance,</em> questions related to the application
and machine were also discussed. In terms of the parallel application, the
following factors were identified:
<ul>
<li>level of parallelism exploited
<li>data layout
<li>possibility of using prefetching schemes
<li>programming model
<li>decomposition methods
<li>application scalability
<li>communication pattern (use of broadcasts, multicasts, or local
communication)
<li>amount of data sharing
<li>application scalability
<li>dynamic behavior
</ul>
<BR>
In terms of the parallel machine, the
following factors were identified:
<ul>
<li>bisection bandwidth
<li>communication to
computation ratio
<li>topology
</ul>
<BR>
It was agreed that
a
team of computer scientists and application scientists
should conduct in-depth case studies to determine
the relative effect of these factors---independent of the parallel machine and the programming model.
<P>
<H3>Findings</H3>
Performance models can play a useful role in assessing and predicting the performance of an application.
For effective use for Petaflops applications,
however,
researchers must carry out the following tasks:
<ol>
1. Develop performance models that incorporate
second- and third-order machines
parameters,
in particular,
analysis of
network congestion
and synchronization time resulting from machine diameter.
<P>
2. Design
tools for validating and refining the performance models;
these tools include hardware performance monitors and memory monitors for
the entire memory hierarchy.
<P>
3. Analyze the
interrelationship of core algorithms
taken from various disciplines and
used in the
same application.
<P>
4. Devise
a methodology or technique for using the detailed model of a core algorithm
to predict the performance of a full application.
</ol>
<P>
<HR>
<P>
<h2> Major Findings</h2>
<P>
The PetaFLOPS Summer Workshop provided the most detailed examination to date
of the opportunities for application of petaFLOPS-scale computing to important
problems in science, engineering, information management, and commerce.
It exposed the computational requirements and scaling properties of a number
of
programs and related these to the characteristics and capabilities of
proposed architectures.
<P>
This section captures the
principal findings of the workshop with
brief explanations.
<P>
<ol>
1. Many significant applications have been identified that either demand or
would greatly benefit from computers capable of petaFLOPS performance.
<P>
2. PetaFLOPS computer systems will be realizable through enhanced natural
evolution of technology and systems architecture in 20 to 25 years using
largely commercially available components. But processor architectures
will require enhancements for broadest applicability at petaFLOPS scale,
which include
<ul>
<P>
<li>64-bit and 128-bit integer and floating-point data formats,
<P>
<li>bit and byte operations,
<P>
<li>64-bit addressing, and
<P>
<LI>
divide and square root operations.
</ul>
<P>
3. Latency will be a major factor in determining performance efficiency.
Latency management in hardware and software will be a pacing element in
achieving sustained petaFLOPS performance for a broad range of real-world
problems.
<P>
<ul>
<P>
<li>Memory access patterns and bandwidth requirements of petaFLOPS
applications show diversity.
<P>
<li>Conventional cache policies and mechanisms will be inadequate.
<P>
<li>Conventional memory hierarchy precludes convenient exploitation of
alternative storage technologies for basic computation.
</ul>
<P>
4. Memory capacity requirements vary dramatically across the array of
important petaFLOPS applications.
<ul>
<P>
<li>Data-intensive applications can require in excess of a petabyte of
working storage.
<P>
<li>Response-time-critical applications may employ less than a terabyte
of memory.
<P>
<li>A large and general class of simulation-oriented applications
exhibit memory demands less than or equal to the 3/4 rule of the
performance yielding a requirement for 32 terabytes or less.
<P>
</ul>
5. Limitations of parallelism in some petaFLOPS-scale applications will
require system architectures to run at the highest possible clock rates
within the constraints of economic viability. Having fewer
higher-performance
processors is preferred in lieu of a greater number of light-weight
processors.
<P>
6. PetaFLOPS architectures may be feasible in a decade, at least for
domain-specific applications, if alternative approaches other than those based on
commercial microprocessor technology are pursued.
<P>
7. Proposed petaFLOPS architectures are insufficiently well defined and
characterized to enable detailed applications studies to be conducted.
<P>
8. PetaFLOPS application programming will require formalisms incorporating
the necessary constructs for better application-driven management of
resources to circumvent latencies, expose rich forms of parallelism, and
balance workloads.
</ol>
<P>
<HR>
<h3>Preliminary Recommendations</h3>
<P>
The PetaFLOPS Applications Workshop determined
the availability of important candidate applications for
PetaFLOPS computing and validated the feasibility of implementing a petaFLOPS
architecture in the twenty-year timeframe.
Nevertheless,
many questions remain concerning
efficiency, programmability, cost, nearer-term opportunities, generality, and
other issues related to the practical exploitation of petaFLOPS. Some
recommendations central to advancing the state of petaFLOPS computing as it is
currently conceived are provided in this section.
<P>
<ol>
1. Perform further applications studies to
<ul>
<P>
<li>expand the range of known petaFLOPS applications, and
<li>identify and characterize requirements and limiting factors.
</ul>
<P>
A preliminary list of limiting factors includes
the following:
<P>
+ degree of parallelism,
<br>
+ nonparallel (scalar) regions,
<br>
+ the critical path length of the code
<br>
+ the mix of instruction types (memory reference, data operation),
<br>
+ data movement and access patterns,
<br>
+ critical, low-latency areas for memory access,
<br>
+ critical, high-bandwidth requirements for memory access,
<br>
+ need for barrier synchronization,
<br>
+ data layout at run time and prediction of changes in data locality as computation proceeds,
<br>
+ overall scalability (presuming increases in memory size or CPU performance), and
<br>
+ need for 128-bit precision.
<P>
2. Support research into specific advanced architecture techniques:
<P>
<ul>
<li><em>Latency hiding</em> for ``just-in-time" operand delivery such as
multithreading, gather-scatter pre/post fetching, streams, and out-of-order execution, and branch prediction. To assist in supporting latency hiding, the operation of the memory architecture should be controllable from the system and/or application software level.
<P>
<li><em>Memory hierarchy</em> to reduce penalty barriers and establish memory
continuum for seamless out-of-core computation.
Specifically, the technologies, costs, and performance tradeoffs of using various sizes and speeds of memory technologies should be investigated.
A short list of these technologies includes various combinations of SRAM, DRAM, and flash.
<P>
</ul>
3. Sponsor development exploration in advanced technologies:
<P>
<ul>
<li>superconductor logic and storage such as RFSQ
<li>
optoelectronic communications between subassemblies and between
integrated circuits
<li>optical storage
<li>advanced lithography and semiconductor fabrication
<li>
uniprocessor performance and clock rates for peta
FLOPS performance
</ul>
<P>
4. Conduct <em>point design</em> studies of petaFLOPS architectures
that are more detailed than the current architecture sketches.
These point designs should be accompanied with simulators and possibly prototypes that are capable of
running key target applications.
Such designs should relax conventional assumptions concerning system structure and
constituent elements, including the use of microprocessors, program
counter-driven flow control, classic memory hierarchy, and logic/byte balance.
The designs should be carried out by teams
with expertise in both architecture and applications.
Four different system types should be considered:
<P>
<ul>
<li><em>natural evolution</em>, through commercial vendors
(e.g., by using PIM structures)
<P>
<li><em>data-intensive</em> focus (petabyte mass storage management-oriented systems)
<P>
<li><em>leapfrog</em> technologies, including cellular structures or hybrid systems with multithreading for high clock rate through
superconductivity, high bandwidths through optical guided and free-space comm
unications, and large memory through semiconductor and
optical storage
<P>
<li> <em>rapid-path</em> systems, automating out-of-core and switched systolic
arrays for low memory content and high bandwidth streaming,
including response time critical computation
<P>
</ul>
<P>
5. Develop performance models to determine
suitability and predict performance across the array of architectures in
support of identified petaFLOPS applications.
The models should include ``applications performance models,''
which can help designers evaluate the need for
so-called fat (high-performance CPUs, high memory) nodes,
and ``machine models,''
which describe memory size, hierarchy, and
management scheme; ability to multithread or overlap serial
code execution; special instructions to reduce computation load; communications architecture and capabilities; and I/O architecture.
The two types of model should be used
together to determine the overall match of a particular architecture to a specific application.
<P>
6. Fund advanced system software research in runtime systems
and innovative compiler-driven resource management heuristics.
Basic issues include
<P>
<ul>
<li>
context switching and multithreading support
<li>
task dispatching and dynamic node management
<LI>
management of the memory hierarchy and software support
<LI>
access mechanisms, structure, and support for parallel I/O
<li>
techniques to manage node and/or communications failures within the system
<li>
real-time and time-to-solution support
</ul>
<P>
7. Sponsor additional workshops in architecture, algorithms, and system
software related to petaFLOPS computing to extend the basis of
consideration and expand the research community. Issues to be addressed
should include
<P>
<ul>
<li>molecular computing,
quantum computing
nanotechnology,
distributed interactive simulation,
nonlinear electronic simulation, and
structural mechanics
<li>
parallel software debugging
<li>
performance measurement tools and models
<li>
parallel scientific libraries
<li>
advanced visualization
</ul>
<BR>
<P>
8. Establish a standing petaFLOPS oversight committee to monitor ongoing
research, assess progress, and recommend new initiatives.
<P>
</ol>
<P>
<P>
<H2>References</h2>
<P>
1. D. H. Bailey, J. M. Borwein, and R. D. Girgensohn, ``Experimental Evaluation
of Euler Sums,''
<em> Experimental Mathematics,</em> Fall 1994.
<P>
2. Decyk, Viktor K., University of California at Los Angeles, personal communication, 1995.
<P>
3. Bylaska, Kohn, Baden, Edelman, Kawai, Ong, and Weare, 1995.
<P>
4. Kohn, dissertation, 1995.
<P>
5. MacAdam and Shapiro, <em>Science,</em> 1995
<P>
6. Marino, M. M., W. C. Ermler, C. W. Kern, and V. E. Bondybey, <em> J. Chem. Phys.</em> <b> 96</b>, 3756 (1992).
<P>
7. Report of the Committee
on Mathematical Challenges from Quantum Chemistry of the Board on Mathematical Sciences and Board
on Chemical Sciences and Technology,
National Academy Press, Washington, D.C., 1995.
<P>
8. Sterling, T., P. Messina, and P. H. Smith. <em>Enabling Technologies for PetaFLOPS Computing,</em>
The MIT Press, Cambridge, Mass., 1995
<P>
9. <em>Wall Street Journal</em> August 18, 1995
<P>
<H3>Additional Information</H3>
<P>
Additional information about the PetaFLOPS II workshop is available as follows:
<P>
<ul>
<li><A href="http://www.mcs.anl.gov/summer.study/participants/participants.html">Attendees (and e-mail addresses) </a>
</ul>
<BR>
<P>
<HR>
<P>
<H3>Additional Information</H3>
<P>
Additional information about the PetaFLOPS II workshop is available as follows:
<P>
<ul>
<li><A href="http://www.mcs.anl.gov/summer.study/participants/participants.html">Attendees (and e-mail addresses) </a>
</ul>
<BR>
<P>
<HR>