... | @@ -21,9 +21,21 @@ Meanwhile, in the HEAD, the last 6 months have seen more than 1000 patches pushe |
... | @@ -21,9 +21,21 @@ Meanwhile, in the HEAD, the last 6 months have seen more than 1000 patches pushe |
|
- Type system generally **Simon PJ**
|
|
- Type system generally **Simon PJ**
|
|
- Inliner changes (incl CONLIKE) **Simon PJ**
|
|
- Inliner changes (incl CONLIKE) **Simon PJ**
|
|
- Data parallel Haskell: **Manuel**
|
|
- Data parallel Haskell: **Manuel**
|
|
- Runtime system, esp message passing changes. **Simon M**.
|
|
- The [ Threadscope](http://research.microsoft.com/threadscope) tool for visualising parallel execution was released. The tool is ripe for improvement in many ways, if you're interested in helping let us know!
|
|
- Parallel GC; removing "steps". **Simon M**.
|
|
|
|
- Threadscope? **Simon M**
|
|
### Runtime system work (SimonM)
|
|
|
|
|
|
|
|
|
|
|
|
There has been a lot of restructuring in the RTS over the past few months, particularly in the area of parallel execution. The biggest change is to the way "blackholes" work: these arise when one thread is evaluating a lazy computation (a "thunk"), and another thread or threads demands the value of the same thunk. Previously, all threads waiting for the result of a thunk were kept in a single global queue, which was traversed regularly. This lead to two performance problems. Firstly, traversing the queue is O(n) in the number of blocked threads, and we recently encountered some benchmarks in which this was the bottleneck. Secondly, there could be a delay between completing a computation and waking up the threads that were blocked waiting for it. Fortunately, we found a design that solves both of these problems, while adding very little overhead.
|
|
|
|
|
|
|
|
|
|
|
|
We also fixed another pathalogical performance case: when a large numbers of threads are blocked on an MVar and become unreachable at the same time, reaping all these threads was an O(n<sup>2</sup>) operation. A new representation for the queue of threads blocked on an MVar solved this problem.
|
|
|
|
|
|
|
|
|
|
|
|
At the same time, we rearchitected large parts of the RTS to move from algorithms involving shared data structures and locking to a message-passing style. As things get more complex in the parallel RTS, using message-passing let us simplify some of the invariants and move towards having less shared state between the CPUs, which will improve scaling in the long run.
|
|
|
|
|
|
|
|
|
|
|
|
The GC has seen some work too: the goal here is to enable each processor ("capability" in the internal terminology) to collect its private heap independently of the other processors. It turns out that this is quite tricky to achieve in the context of the current architecture, but we have made some progress in this direction by privatising more of the global state and simplifying the GC data structures by removing the concept of "steps", while keeping a simple aging policy which is what steps gave us previously.
|
|
|
|
|
|
### Nightly builds
|
|
### Nightly builds
|
|
|
|
|
... | | ... | |