Lazy Threads: Compiler and Runtime Structures for Fine-Grained Parallel Programming
Abstract: Many modern parallel languages support dynamic creation of threads or require multithreading in their implementations. The threads describe the logical parallelism in the program. For ease of expression and better resource utilization, the logical parallelism in a program often exceeds the physical parallelism of the machine and leads to applications with many fine-grained threads. In practice, however, most logical threads need not be independent threads. Instead, they could be run as sequential calls, which are inherently cheaper than independent threads. The challenge is that one cannot generally predict which logical threads can be implemented as sequential calls. In lazy multithreading systems each logical thread begins execution sequentially (with the attendant efficient stack management and direct transfer of control and data). Only if a thread truly must execute in parallel does it get its own thread of control.
This dissertation presents new implementation techniques for lazy multithreading systems on conventional machines and compares a variety of design choices. We develop an abstract machine that makes explicit the design decisions for achieving lazy multithreading. We introduce new options on each of the four axes in the design space: the storage model, the thread representation, the disconnection method, and the queueing mechanism. Stacklets, our new storage model, allows parallel calls to maintain the invariants of sequential calls. Thread seeds, our new thread representation, allows threads to be stolen without requiring thread migration or shared memory. Lazy-disconnect, our new disconnection method, does not restrict the use of pointers. Implicit and Lazy queueing, our two new queueing mechanisms, eliminate the need for explicit bookkeeping. Additionally, we develop a core set of compilation techniques and runtime primitives that form the basis for the efficient implementation of any design point.
We have evaluated the different approaches by incorporating them into a compiler for an explicitly parallel extension of Split-C. We show that there exist points in the design space (e.g., stacklet, thread seeds, lazy-disconnect, and lazy queueing) for which fine-grained parallelism can be efficiently supported even on distributed memory machines, thus allowing programmers freedom to specify the parallelism in a program without concern for excessive overhead.