A Fortran-P Programming Paradigm for Clusters of Shared-Memory Multiprocessors

A Fortran-P Programming Paradigm for Clusters of Shared-Memory MultiprocessorsA Fortran-P Programming Paradigm for Clusters of Shared-Memory Multiprocessors

Paul Woodward, Matthew O'Keefe, S.E. Anderson, Aaron Sawdey,
David Porter, Terence Parr, and B. Kevin Edgar
University of Minnesota &
Army High Performance Computing Research Center, Minneapolis, MN

Research Objective: To determine an appropriate Fortran-77 programming model, with direct-ives and enhanced semantics, which will allow automatic generation of efficient parallel code by a future version of the Fortran-P precompiler for self-similar algorithms targeted at clusters of shared-memory multiprocessors.

Methodology: A variety of experiments were performed using the 20-processor Silicon Graphics (SGI) Onyx machine in the AHPCRC's Graphics and Visualization Laboratory (GVL) in order to determine effective programming techniques for dealing with cache-based memory architectures and for dealing with shared memory multiprocessing in this environment. Access to clusters of such machines was arranged through the courtesy of Silicon Graphics, and ideas for effective structuring of parallel Fortran code were tested for this special architecture in which machine-to-machine latencies are relatively large, but each machine can attack a relatively large fraction of the computational task. Access to Silicon Graphics' latest Power Challenge machines, which have faster processors which prefer vector programming styles and which have a different cache organization, was also provided for testing of multitasking strategies.

Accomplishments: Techniques were devised which greatly improve cache memory performance for the PPM codes. These techniques involve interleaving of data in memory and blocking of the interleaved data arrays. These code transformations alter the appearance of the Fortran code substantially, but they can be automated so that these transformations need not be visible to the user. A small subset of useful multitasking directives was also determined which allow the number of barrier synchronizations to be reduced, so that the multitasking efficiency is dramatically increased. For large problems, effective means of domain decomposition and message passing were found to implement PPM computations on a cluster of shared memory machines. This experimentation with PPM code implementations on an 18-processor SGI Power Challenge machine resulted in a demonstration program which was featured in the SGI booth at the SIGGRAPH exhibit. Sustained performance of 98.5 Mflops per MIPS R-8000 processor and a 15.8 times multitasking speedup using 16 of these processors was demonstrated in an inter-actively steered fluid dynamical computation at SIGGRAPH.

Significance: The computer architecture represented by Silicon Graphics' Challenge Array and Power Challenge Array products is becoming increasingly important in the HPCC arena. Development of a Fortran-P programming paradigm, and ultimately a precompiler, for this new architecture should therefore be of considerable benefit to the HPCC community. Essentially all high performance computer architectures which are now popular and which are planned for the next two or three years can be viewed as subcases of the SMP Array architecture targeted in this project. For example, the Cray C90 can be viewed as an SMP Array with only a single array element, the Cray T3D can be viewed either in this same way or as an SMP array with only a single processor in each SMP, the IBM SP-2 can be viewed as an SMP array with one CPU per SMP, as can any of the workstation clusters now in use, and finally the Convex Exemplar, just now appearing on the scene, can be viewed as an SMP array. It is therefore possible that the Fortran- P programming paradigm we are developing could be used not only to write portable code, but also to write portable efficient code for self-similar algorithms.

Future Plans: Development of the programming model will be extended to permit efficient execution of smaller problems and to handle more irregular problems, with load balancing across the network. Silicon Graphics Power Challenge Array systems and also IBM SP-series systems will be used for this future work. The programming model should also work well on the Cray T3D, and we will make every effort to see that this is indeed the case.