High-level design of nested data parallelism in GHC
The NDP part of the compiler is made up of two components: a parallel array library and code vectorisation, a transformation which eliminates nested parallelism. The library defines the type of parallel arrays, supporting operations and typeclasses and a loop fusion framework. Crucially, the actual representation of a parallel array is determined by the type of its elements. For instance, [:Int:]
is just an array of unboxed Int
s whereas [:(a,b):]
is, essentially, a pair of arrays ([:a:],[:b:])
. Associated data types and type synonyms allow us to implement this entirely in the library, without having to modify the compiler. In contrast to this, code vectorisation is implemented as a Core-to-Core transformation in GHC. In order to be able to deal with higher-order functions in parallel contexts, we also perform closure conversion as part of code vectorisation.
This is a rough sketch of how the various components fit into the compiler pipeline:
Program Library
======= =======
Haskell
parallel arrays <--------------+
nested parallelism |
| |
* desugaring |
| |
v |
Core |
parallel arrays <--------------+--------------- Parallel arrays ([:.:])
nested parallelism | |
| | |
* vectorisation | |
| | |
v | |
Core | |
parallel arrays <--------------+ |
flat parallelism |
| |
* inlining |
| |
v |
Core v
unboxed arrays <------------------------------ Parallel operations
(parallel operations) on unboxed arrays
| |
* inlining |
| |
v |
Core v
distributed types <----+------------------------ Distributed types
unboxed arrays | |
(sequential operations) | +---------------+---------------+
| | | |
* fusion | v |
| +----------- Unboxed arrays |
v | (sequential) |
Core | |
distributed types <----+ |
unboxed arrays |
(sequential operations) |
| |
* inlining |
| |
v |
Core v
gang operations <---------------------------------------------------- Gangs
ByteArrays
|
* optimisation
|
v
Core
gang operations
|
* code generation
|
v
Object code
RTS with gangs
Haskell with nested parallelism | This is vanilla Haskell with the parallel array data type [:.:], array comprehensions, and collective array operations. |
---|---|
Core with nested parallelism | The result of normal desugaring; in particular, array comprehensions are eliminated. |
Core with flat parallelism | code vectorisation replaces nested parallelism by operations on flat arrays. |
Core with parallel operations on unboxed arrays | All operations on [:.:] are inlined, leaving only parallel operations on simple unboxed arrays. |
Core with distributed types and sequential unboxed arrays | Parallel operations on unboxed arrays, which are implemented in terms of distributed types and sequential array operations, are inlined. Fusion rules are applied to the result. |
Core with gang operations | Operations on distributed types and unboxed arrays are inlined, producing code which only makes use of gang operations and the standard libraries. It is optimised as usual. |
See the paper Data Parallel Haskell: a status report for an in-depth illustration of this architecture with concrete code of a running example at each transformation stage.