Data-Parallel Language for Correct and Efficient Sparse Matrix Codes
Technical Report Identifier: EECS-2011-142
December 16, 2011
Abstract: Sparse matrix formats encode very large numerical matrices with relatively few nonzeros. They are typically implemented using imperative languages, with emphasis on low-level optimization. Such implementations are far removed from the conceptual organization of the sparse format, which obscures the representation invariants. They are further specialized by hand for particular applications, machines and workloads. Consequently, they are difficult to write, maintain, and verify.
In this dissertation we present LL, a small functional language suitable for implementing operations on sparse matrices. LL supports nested list and pair types, which naturally represent compressed matrices. It provides a few built-in operators and has a unique compositional dataflow model. As a result, LL programs are often direct analogs of high-level dataflow diagrams. This correspondence is useful when implementing sparse format construction and algebraic kernels such as sparse matrix-vector multiplication (SpMV). Despite its limited expressiveness, LL can specify interesting operations on real-world sparse formats.
Next, we present a full functional verifier for sparse codes written in LL. We use a higher-order logic theorem prover, deploying standard simplification and introduction techniques. We use novel parametric predicates for tracking relationships between mathematical objects and their concrete representation. A simple heuristic tactic is used for automating proofs of different sparse formats. A qualitative analysis shows that our rule base exhibits good reusability and that the system easily extends to hierarchical formats.
Finally, we describe a compiler for LL programs that generates efficient, parallel C code. We systematically map high-level nested datatypes onto compact and efficient low-level types. This facilitates a straightforward, syntax-directed translation of code. Local optimizations improve the performance of nested loops and eliminate redundant memory accesses. A coarse-grained heuristic loop parallelization scheme produces data-parallel code for execution on shared-memory multicore machines. Empirical evaluation shows that the sequential throughput of generated SpMV code is within 0.94--1.09 of a handwritten implementation. Parallel execution is 1.82--1.98 faster on two cores and 2.44--3.59 faster on four cores.