Luite Stegeman requested to merge luite/ghc:ghci-unboxed-tuples into master Oct 11, 2020

This is a proposed change to add support for unboxed tuples in GHCi, fixing #1257 (closed)

The implementation works (only tested on linux x86_64), but code, tests, and documentation aren't quite complete and polished yet.

I'm hoping to get some feedback to see if I should make any changes or to find potential issues to be aware of.

The problem

GHC has a native calling convention that uses (STG) registers for some function arguments and return values, with the remainder spilled to the STG stack. The GHCi bytecode interpreter only uses the stack. Running interpreted code often involves calling native functions or primops, requiring marshalling of arguments and return values.

All lifted values have the same calling convention, i.e. the same registers are used regardless of the underlying type. This makes conversion between interpreted and native code simple. Unboxed types are different, for example a Double# value would use a different register than a Word#. GHCi supports a limited set of unboxed types, each of which is specifically supported with a pair of functions that handles converting between the native calling convention and the interpreter. For example stg_ctoi_D1 / stg_ret_d for Double# values passed in register D1.

Unboxed tuples add an unlimited number of variations to the register/stack usage. A working solution must at least do the following:

convert the tuple between everything-on-stack (interpreter) and registers-and-stack (native)
describe the data in the tuple for the storage manager.

How it works

Tuple on the stack

The layout for the tuple in the interpreter is based on the native calling convention (even for tuples that stay inside the interpreter): All elements that are on the stack in the native calling convention stay in the same place. The remaining elements are ordered by the register in which they would be passed (which may be different than the order in the source code). Each element is padded to machine word size.

Implementation

Two new stack frames, stg_ctoi_t (native->interpreter) and stg_ret_t (interpreter->native) handle the conversion of tuples, both have type RET_BCO.

The implementation adds new bytecode instructions PUSH_ALTS_T and RETURN_T, and two pieces of data:

tuple_info: a word describing the stack/register usage of the tuple, used by stg_ctoi_t/stg_ret_t
tuple_BCO: a BCO that returns the tuple in the interpreter. mainly used for its bitmap, describing the tuple contents for the storage manager.

Returning a tuple

A RETURN_T instruction is used to return a tuple, after pushing all the elements to the stack, followed by the tuple_info and tuple_BCO:

<push all tuple elements>
PUSH_UBX tuple_info
PUSH_BCO tuple_BCO
RETURN_T

The only effect of RETURN_T is that the stg_ret_t_info header is pushed to the top of the stack.

After this, the stack (growing down) looks as follows:

<next stack frame>
<tuple elements>
tuple_info
tuple_BCO
stg_ret_t_info <- Sp

The bitmap in tuple_BCO describes the content of the tuple on the stack plus the tuple_info (non-pointer) slot.

The interpreter then checks whether the next frame on the stack is also interpreted. If not, it jumps to the scheduler to continue. stg_ret_t then moves the tuple to the correct registers and jumps to the (compiled) continuation.

Receiving a tuple

PUSH_ALTS_T is used to push a continuation that receives a tuple.

PUSH_ALTS_T cont_BCO tuple_info tuple_BCO
<function call here>

Upon executing the PUSH_ALTS_T instruction, the interpreter pushes the continuation, the tuple_BCO and tuple_info to the stack, followed by the stg_ctoi_t_info header:

<next stack frame>
<other stuff in cont_BCO stack frame>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info <- Sp

The whole stack frame is described by the bitmap in cont_BCO, including the tuple_info (non-pointer) and tuple_BCO (pointer) slots.

When a tuple is returned to stg_ctoi_t, the stack looks as follows:

<next stack frame>
<other stuff in cont_BCO stack frame>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info
spilled_tuple_elem_1
spilled_tuple_elem_2 <- Sp

stg_ctoi_t then goes to work, using tuple_info to determine the correct registers to save on the stack:

<other stuff on stack>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info
spilled_tuple_elem_1
spilled_tuple_elem_2
saved_reg_tuple_elem_1
saved_reg_tuple_elem_2
saved_reg_tuple_elem_3
tuple_info
tuple_BCO
stg_ret_t <- Sp

stg_ctoi_t then jumps to the scheduler to make the switch to the interpreter. The interpreter executes tuple_BCO, which contains a RETURN_T instruction. The next frame (stg_ctoi_t) is of type RET_BCO, and the interpreter continues with the original continuation cont_BCO.

In reality, GHCs native calling convention makes stg_ctoi_t a little more complicated, and we need some helper functions to adjust Sp to a known location. See StgMiscClosures.cmm and Interpreter.c for more details.

Limitations

Unboxed sums still aren't supported. These would be converted to tuples by the Unarise pass. Perhaps we should change the bytecode generator to generate bytecode from STG after it has been unarised?
We need helper functions to adjust Sp based on the number of words spilled to the stack when returning a tuple in the native calling convention (see StgMiscClosures.cmm). That creates a limit on tuple sizes, that can only be raised by rebuilding GHC. Perhaps we can find some way around it, possibly by changing the calling convention itself (always having Sp pointing to the next frame, in our case stg_ctoi_t, when returning something).

Variations/alternatives

tuple_info and tuple_BCO are always used in pairs. we could move tuple_info into the BCO struct to save some stack space. The only requirement is that the Cmm implementation of stg_ctoi_t and stg_ret_t can access it.
Currently the stack slots of tuple_info and tuple_BCO are both part of the stack frame described by cont_BCO. We could make it a separate stack frame.

Edited Jan 25, 2021 by Luite Stegeman

WIP: Add support for unboxed tuples in the GHCi bytecode interpreter