Microbenchmarking and Performance Prediction for Parallel Computers
Abstract: Previous research on this project (in work by Saavedra and Smith) has presented performance evaluation of sequential computers. That work presented (a) measurements of machines at the source language primitive operation level; (b) analysis of standard benchmarks; (c) prediction of run times based on separate measurements of the machines and the programs; (d) analysis of the effectiveness of compiler optimizations; and (e) measurements of the performance and design of cache memories.
In this paper, we extend the earlier work to parallel computers. We describe a portable benchmarking suite and performance prediction methodology, which accurately predicts the run times of Fortran 90 programs running upon supercomputers. The benchmarking suite measures the optimization capabilities of a given Fortran 90 compiler, execution rates of abstract Fortran 90 operations, and the processing characteristics of the underlying architecture as exposed by compiler-generated code. To predict the run time of an arbitrary program, we combine our benchmark results with dynamic execution measurements, and augment the resulting prediction with simple factors which account for overhead due to architecture-specific effects, such as remote reference latencies. We measure two supercomputers: a dedicated 128-node TMC CM-5, a distributed memory multiprocessor, and a 4-node partition of a Cray YMP-C90, a tightly-integrated shared memory multiprocessor. Our measurements show that the performance of the YMP-C90 far outstrips that of the CM-5, due to the quality of the compilers available and the architectural characteristics of each machine. To validate our prediction methodology, we predict the run time of five interesting kernels on these machines; nearly all of the predicted run times are within 50-percent of actual run times, much closer than might be expected.