WIP: Add support for unboxed tuples in the GHCi bytecode interpreter
This is a proposed change to add support for unboxed tuples in GHCi, fixing #1257 (closed)
The implementation works (only tested on linux x86_64), but code, tests, and documentation aren't quite complete and polished yet.
I'm hoping to get some feedback to see if I should make any changes or to find potential issues to be aware of.
The problem
GHC has a native calling convention that uses (STG) registers for some function arguments and return values, with the remainder spilled to the STG stack. The GHCi bytecode interpreter only uses the stack. Running interpreted code often involves calling native functions or primops, requiring marshalling of arguments and return values.
All lifted values have the same calling convention, i.e. the same registers are used regardless of the underlying type. This makes conversion between interpreted and native code simple. Unboxed types are different, for example a Double#
value would use a different register than a Word#
. GHCi supports a limited set of unboxed types, each of which is specifically supported with a pair of functions that handles converting between the native calling convention and the interpreter. For example stg_ctoi_D1
/ stg_ret_d
for Double#
values passed in register D1
.
Unboxed tuples add an unlimited number of variations to the register/stack usage. A working solution must at least do the following:
- convert the tuple between everything-on-stack (interpreter) and registers-and-stack (native)
- describe the data in the tuple for the storage manager.
How it works
Tuple on the stack
The layout for the tuple in the interpreter is based on the native calling convention (even for tuples that stay inside the interpreter): All elements that are on the stack in the native calling convention stay in the same place. The remaining elements are ordered by the register in which they would be passed (which may be different than the order in the source code). Each element is padded to machine word size.
Implementation
Two new stack frames, stg_ctoi_t
(native->interpreter) and stg_ret_t
(interpreter->native) handle the conversion of tuples, both have type RET_BCO
.
The implementation adds new bytecode instructions PUSH_ALTS_T
and RETURN_T
, and two pieces of data:
-
tuple_info
: a word describing the stack/register usage of the tuple, used bystg_ctoi_t
/stg_ret_t
-
tuple_BCO
: a BCO that returns the tuple in the interpreter. mainly used for its bitmap, describing the tuple contents for the storage manager.
Returning a tuple
A RETURN_T
instruction is used to return a tuple, after pushing all the elements to the stack, followed by the tuple_info
and tuple_BCO
:
<push all tuple elements>
PUSH_UBX tuple_info
PUSH_BCO tuple_BCO
RETURN_T
The only effect of RETURN_T
is that the stg_ret_t_info
header is pushed to the top of the stack.
After this, the stack (growing down) looks as follows:
<next stack frame>
<tuple elements>
tuple_info
tuple_BCO
stg_ret_t_info <- Sp
The bitmap in tuple_BCO
describes the content of the tuple on the stack plus the tuple_info
(non-pointer) slot.
The interpreter then checks whether the next frame on the stack is also interpreted. If not, it jumps to the scheduler to continue. stg_ret_t
then moves the tuple to the correct registers and jumps to the (compiled) continuation.
Receiving a tuple
PUSH_ALTS_T
is used to push a continuation that receives a tuple.
PUSH_ALTS_T cont_BCO tuple_info tuple_BCO
<function call here>
Upon executing the PUSH_ALTS_T
instruction, the interpreter pushes the continuation, the tuple_BCO
and tuple_info
to the stack, followed by the stg_ctoi_t_info
header:
<next stack frame>
<other stuff in cont_BCO stack frame>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info <- Sp
The whole stack frame is described by the bitmap in cont_BCO
, including the tuple_info
(non-pointer) and tuple_BCO
(pointer) slots.
When a tuple is returned to stg_ctoi_t
, the stack looks as follows:
<next stack frame>
<other stuff in cont_BCO stack frame>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info
spilled_tuple_elem_1
spilled_tuple_elem_2 <- Sp
stg_ctoi_t
then goes to work, using tuple_info
to determine the correct registers to save on the stack:
<other stuff on stack>
tuple_BCO
tuple_info
cont_BCO
stg_ctoi_t_info
spilled_tuple_elem_1
spilled_tuple_elem_2
saved_reg_tuple_elem_1
saved_reg_tuple_elem_2
saved_reg_tuple_elem_3
tuple_info
tuple_BCO
stg_ret_t <- Sp
stg_ctoi_t
then jumps to the scheduler to make the switch to the interpreter. The interpreter executes tuple_BCO
, which contains a RETURN_T
instruction. The next frame (stg_ctoi_t
) is of type RET_BCO
, and the interpreter continues with the original continuation cont_BCO
.
In reality, GHCs native calling convention makes stg_ctoi_t
a little more complicated, and we need some helper functions to adjust Sp
to a known location. See StgMiscClosures.cmm
and Interpreter.c
for more details.
Limitations
-
Unboxed sums still aren't supported. These would be converted to tuples by the
Unarise
pass. Perhaps we should change the bytecode generator to generate bytecode from STG after it has been unarised? -
We need helper functions to adjust
Sp
based on the number of words spilled to the stack when returning a tuple in the native calling convention (seeStgMiscClosures.cmm
). That creates a limit on tuple sizes, that can only be raised by rebuilding GHC. Perhaps we can find some way around it, possibly by changing the calling convention itself (always havingSp
pointing to the next frame, in our casestg_ctoi_t
, when returning something).
Variations/alternatives
-
tuple_info
andtuple_BCO
are always used in pairs. we could movetuple_info
into the BCO struct to save some stack space. The only requirement is that the Cmm implementation ofstg_ctoi_t
andstg_ret_t
can access it. - Currently the stack slots of
tuple_info
andtuple_BCO
are both part of the stack frame described bycont_BCO
. We could make it a separate stack frame.