Exploiting Fine Grain Parallelism in Prolog
Technical Report Identifier: CSD-90-588
Abstract: The potential usefulness of Prolog has motivated attempts to achieve higher performance than the fastest sequential Prolog systems currently available. Several researchers have attempted to do this by exploiting parallelism. The fundamental tradeoff is exposing more parallelism versus reducing the overhead incurred by parallel execution. The high execution overhead of the complex memory management and task management schemes used in current parallel Prolog systems leads to the following questions: (1) What forms of parallelism can be effectively exploited with memory management and task management schemes that have low overhead? (2) How much speedup can be obtained by exploiting these forms of parallelism? (3) What architectural support is required to exploit these forms of parallelism?
The goals of this research are to provide answers to these questions, to design a Prolog system that automatically exploits parallelism in Prolog with low overhead memory management and task management schemes, and to demonstrate by means of detailed simulations that such a Prolog system can indeed achieve a significant speedup over the fastest sequential Prolog systems.
We achieve these goals by first identifying the large sources of overhead in parallel Prolog execution: side-effects caused by parallel tasks, choicepoints created by parallel tasks, task creation, task scheduling, task suspension and context switching.
We then identify a form of parallelism, called flow parallelism, that can be exploited with low overhead because parallel execution is restricted to goals that do not cause side-effects and do not create choicepoints. We develop a master-slave model of parallel execution that eliminates task suspension and context switching. The model uses program partitioning and task scheduling techniques that do not require task suspension and context switching to prevent deadlock. We identify architectural techniques to support the parallel execution model and develop the Flow Parallel Prolog Machine (FPPM) architecture and implementation.
Finally, we evaluate the performance of FPPM and investigate the design tradeoffs using measurements on a detailed, register-transfer level simulator. FPPM achieves an average speedup of about a factor of 2 (as much as a factor of 5 for some programs) over the current highest performance sequential Prolog implementation, the VLSI-BAM. The speedups over other parallel Prolog systems are much larger.