Vector registers are not correctly saved/restored in certain situations
The following code is a cause for concern:
realArgRegsCover :: Platform -> [GlobalReg]
realArgRegsCover platform
| passFloatArgsInXmm platform
= realVanillaRegs platform ++
realLongRegs platform ++
realDoubleRegs platform
This function computes which registers might need to be saved/restored in situations in which we lack liveness information.
Some observations:
- We only need to save/restore registers which are used for argument passing (other registers are only used locally) in Cmm.
This is the "args" part of "realArgRegsCover". On X86_64, for floating point values and SIMD vectors,
we use Cmm registers
XMM1
, ...,XMM6
for argument passing. Taking into account #25156 (closed), this corresponds to X86_64 registers%xmm1
, ...,%xmm6
(notably, we dont use%xmm0
). - We currently only save/restore the lower 64 bits of any xmm register.
This means that we can inadvertently drop the higher part of xmm/ymm/zmm registers used for argument passing, as we don't properly save and restore their entire contents.
realArgRegsCover
is used in the following places:
-
GHC.StgToCmm.Foreign.{emitRestoreRegs,emitSaveRegs}
, which is used solely when returning to a stack underflow frame (seerts/StgMiscClosures/stg_stack_underflow_frame
). - In handwritten Cmm which uses
[*]
which indicates that all registers should be considered live.rts/HeapStackCheck/__stg_gc_fun
rts/StgMiscClosures/{stg_orig_thunk_info_frame,stg_stack_underflow_frame,stg_restore_cccs}
- In the bytecode interpreter, e.g.
PUSH_ARG_REGS
,POP_ARG_REGS
.-
rts/StgMiscClosures/{stg_ctoi_t,stg_ret_t,stg_primcall}
.
-
Ignoring the bytecode interpreter for now (as it has no support for SIMD vectors), all these places might cause us to drop the upper part of xmm/ymm/zmm registers on the floor.
I don't have a reproducer which demonstrates this bug yet, but it should not be particularly difficult to cook one up. This is a live bug with all backends that support SIMD vectors (currently only LLVM, but the X86 NCG in !12860 is also affected).
It's not clear to me how to fix this. The issue is that we can't e.g. unconditionally spill the whole vector register, because it depends what instructions are currently supported on the architecture on which the RTS is running.
Here are a few possibilities for fixing this:
Plan A: spill the whole register
With this plan, in the RTS we would first do a runtime check of the architecture the RTS is currently running on, and which instructions it supports. If the architecture supports 128-bit vectors, unconditionally load/restore 128-bit wide registers; if it supports 256-bit vectors, unconditionally load/restore 256-bit wide registers, etc.
This has a few issues:
- it is a bit wasteful, as most of the time we won't be using the full register,
- it's not clear how to achieve this with the LLVM backend; if we annotate the jump with ZMM registers the LLVM backend will do nonsense if AVX512F is not supported (with the NCG we can fall-back the load/store instructions to use YMM or XMM size if ZMM/YMM is not supported).
Plan B: spill what is being used in the register
In plan B, we would somehow keep track of what things are live in registers. Considering the example of a stack underflow frame, when we jump to the stack underflow frame we would need to be able to retrieve from somewhere (in a register or on the stack) the information about how much to save/restore. Then we could branch on that information and jump to code that has been conditionally compiled with the right SIMD support (much like I am already doing in !12860 for AutoApply.cmm
).
It's not currently clear to me where this information would be stored and who would be responsible for storing it there, but this would also incur an overhead. When placing a stack underflow frame, the RTS could inspect the function that we will jump to upon popping the stack underflow frame to see which vector registers it uses.
Plan C: don't touch XMM registers
The load/store code is only used in certain pieces of hand-written Cmm in the RTS. So we could require that any intervening code does not use any of the XMM registers used for argument passing.
For example, we save and restore registers around the call to the C function threadStackUnderflow
in the Cmm code for stg_stack_underflow_frame
. If we can somehow enforce that threadStackUnderflow
does not touch xmm1
, ..., xmm6
then we can avoid saving and restoring any xmm registers altogether.
This is a very risky option, as it will silently lead to incorrect runtime results if the function inside does modify any of those registers.
Plan D: only spill 64 bits, and don't allow SIMD registers to be used for argument passing
If SIMD vectors must always be passed in the stack, then we only need to save/restore the lower 64 bits of xmm/ymm/zmm registers.
However, I think this would be throwing out the baby with the bathwater.