This MR implements GHC proposal 313: Delimited continuation primops by adding native support for delimited continuations to the GHC RTS.
All things considered, the patch is relatively small. It almost exclusively consists of changes to the RTS; the compiler itself is essentially unaffected. The primops come with fairly extensive Haddock documentation, which can be found in
primops.txt.pp. For an overview of the implementation strategy, as well as a discussion of several of the finer details, see the Notes in
The implementation currently prioritizes simplicity over performance. In particular, every continuation is always stored as a single, large chunk of stack. If these chunks grow particularly large, it could result in poor performance, as the current implementation does not attempt to cleverly squeeze stack frames into the existing stack: it all must fit at once. I have made this decision after trying a few different approaches to make things more efficient, all of which were significantly more complicated (see the collapsed section below for more details). In the absence of many reviewers intimately familiar with this part of the RTS, I’ve decided to defer those optimizations to future improvements, if they prove desired.
The MR includes a few simple test cases. There are some tricky interactions, like stack overflow scenarios that remain without thorough testing, as I am not entirely sure how to reliably provoke them. If anyone has any concrete suggestions, I am open to trying them out.
Old notes on multi-chunk restores (deferred)
Performance consideration: multi-chunk restores
CONTINUATION closure is always one giant chunk of stack, even if it originated from arbitrarily many stack chunks. Furthermore, when it is restored, it’s always restored in one, huge go: it ensures there’s enough space in the current stack chunk to fit the entire chunk of captured frames. Since this can be arbitrarily large, an entire stack chunk may not be enough, so restoring a continuation may overflow multiple times in succession!
This strategy is technically handled okay by the RTS, as if multiple stack overflows occur back-to-back, the scheduler will allocate increasingly large stack chunks until the overflows stop. However, this is really quite exceptionally wasteful behavior, as it may result in arbitrarily many intermediate stack chunks being allocated that go completely unused.
Originally, I tried to implement a cleverer strategy that captured these especially-large continuations in multiple chunks, each of which could be restored in succession; this is the strategy taken by
AP_STACK closures. Unfortunately, getting this right is extremely subtle, for two reasons:
It opens up the possibility of an async exception being delivered partway through a continuation restore. This is not a huge deal for
AP_STACKclosures, which are pure thunks and can’t manipulate the async exception masking state, anyway, but that is not true of continuations, so this must never happen.
If the continuation must be restored across multiple stack chunks, we must take care to ensure the split does not happen in the “middle” of a stack frame. That is, if a stack frame includes a payload, we must not put some of it in the lower stack chunk and some of it in the upper stack chunk, as that would prevent the GC from walking the stack.
Both of these issues are solvable, but doing so is tricky. Requirement 1 heavily encourages keeping each
CONTINUATION closure as a single, self-contained unit, as we need to be able to restore it in one go, but requirement 2 encourages breaking it up into multiple chunks (like
AP_STACK thunks are) so that we can use a fast
memcpy to restore the stack frames rather than have to walk them.
In theory, there is no obstacle to addressing both of those points. In practice, the fact that continuation restores are implemented in Cmm and that stack overflows must yield to the scheduler meant I found myself picking between an inefficient implementation (somewhat defeating the point of implementing multi-chunk restores) or a significantly more complicated one. Therefore, the current MR just does the hopelessly naïve thing, which is far simpler. I would appreciate feedback on whether or not this naïve approach is considered acceptable for an initial implementation (which can be optimized further later), or if it’s worth doing something cleverer up-front.