Hiding Communication Costs in Bandwidth-Limited Parallel FFT Computation
Abstract: This paper presents a novel computation schedule for FFT-type computations on a bandwidth-limited parallel computer. Using P processors, we are able to process an n-input FFT graph in the optimal time of n log n / P by carefully interleaving interprocessor communication steps with local computation. Our algorithm is suitable for both shared-memory and distributed memory machines and is analyzed in a simplification of the LogP model suitable for studying bandwidth-limited parallel machines. Our parallel FFT algorithm incorporates several techniques that have long been used by parallel programmers to reduce communication costs and our analysis provides theoretical justification for the success of these techniques in the context of highly structured computations like FFTs. At another level, our algorithm can be viewed as an optimal simulation of large butterfly networks on arbitrary machines (as modeled under LogP). Thus, we argue that computations thought to be inherently suited to butterfly networks can be executed with no loss in efficiency on arbitrary bandwidth-limited networks, given sufficient slack.