Meta-issue for SIMD support in the native code generator
This ticket tracks further items of work relating to SIMD support in GHC native code generators (X86, AArch64, PPC), after the main branch of work lands (!12860 (closed)), for the purposes of collaboration.
These work items broadly correspond to remaining SIMD NCG TODO
comments in the GHC codebase.
Separate tickets will be created for the individual issues. Contributors interested in helping out should declare their interest in one of the following tasks, and they will be assigned the corresponding ticket.
In roughly increasing order of difficulty:
-
Add support for negation of integer vectors: change the MO_VS_Neg
case inGHC.CmmToAsm.X86.CodeGen.getRegister'
. -
Add support for extracting 32, 16 and 8 bit wide integer values from a vector: change the MO_V_Extract
case inGHC.CmmToAsm.X86.CodeGen.getRegister'
. Look at Godbolt output for Intel SIMD intrinsics such as_mm_extract_epi8{8,16,32}
. -
Add support for inserting 32, 16 and 8 bit wide integer values into a vector: change vector_int_insert_sse
inGHC.CmmToAsm.X86.CodeGen
. Look at Godbolt output for Intel SIMD intrinsics such as_mm_insert_epi{8,16,32}
. -
Add support for broadcasting 32, 16 and 8 bit wide integer values to vectors: change the MO_V_Broadcast
case inGHC.CmmToAsm.X86.CodeGen.getRegister'
. Look at Godbolt output for Intel SIMD intrinsics such as_mm_set1_epi{8,16,32}
. -
Add support for arithmetic operations on integer vectors such as MO_V_Add
,MO_V_Mul
,MO_VU_Quot
etc. This would involve changingGHC.CmmToAsm.X86.CodeGen.getRegister'
to generate code for theseMachOp
s. (I expect that the main difficulty will be in handling all the different word sizes.) -
Improve efficiency of vector pack instructions: instead of emitting a bunch of vector insertions in a row, it should be possible to write more efficient code. This would involve only changing GHC.StgToCmm.Prim.doVecPackOp
. I would recommend looking at Godbolt output for Intel SSE intrinsics such as_mm_set_ps
and_mm_set_pd
. -
Improve the code we generate for shuffle primops. See shuffleInstructions
inGHC.CmmToAsm.X86.CodeGen
. -
Add support for integer vector shuffle operations, i.e. handling MO_V_Shuffle
inGHC.CmmToAsm.X86.CodeGen.getRegister'
. It will be a bit tricky to handle all of the cases, in particular shuffling 8 bit values. -
Constant folding for vectors. This would involve (at least): -
adding cases for vector primops in GHC.Core.Opt.ConstantFold
-
adding cases for vector MachOp
s inGHC.Cmm.Opt.cmmMachOpFoldM
-
updating GHC.CmmToAsm.X86.CodeGen.getRegister' ... (CmmLit lit)
-
-
Add support for 128-bit wide vectors in the PowerPC backend using the VSX
instruction set.- Requires adding the instructions to
GHC.CmmToAsm.PPC.Instr
, updatingregUsageOfInstr
to store the more precise format that vector registers are used at, and making use of these instructions to implement the vectorMachOp
s. - Will require updating the testsuite's
get_cpu_features
function to check for availability of these instructions.
- Requires adding the instructions to
-
Add support for 128-bit wide vectors in the AArch64 backend using the NEON
instruction set.- Requires adding the instructions to
GHC.CmmToAsm.AArch64.Instr
, and updatingregUsageOfInstr
to store the more precise format that vector registers are used at, and making use of these instructions to implement the vectorMachOp
s. - Will require updating the testsuite's
get_cpu_features
function to check for availability of these instructions.
- Requires adding the instructions to
-
Add support for SIMD primops in test-primops. (This would first require adding floating-point primop support, which itself probably requires fixing the issues with floating point support in GHC outlined in #8364.) -
Add support for SIMD registers in GHCi. - Requires careful consideration of functions such as
realArgRegsCover
, and probably finding a way of modifyingPOP_ARG_REGS
so that we generate a jump which is annotated with exactly the right size of register, e.g. for passing YMM1 we want to annotate the jump with YMM1 (along with GP registers), not XMM1 or ZMM1. It will also be necessary to ensure that this Cmm is correctly compiled (e.g. passing the right flags to LLVM, as any compilation of e.g. YMM1 without-mavx2
will lead to carnage). The bitset machinery inmkNativeCallInfoSig
will also need to be updated.
- Requires careful consideration of functions such as
-
Add support for ymm/zmm registers in the X86 NCG. The basic task would simply be to add new constructors to VirtualReg
for 256 and 512 wide vectors and use those inGHC.CmmToAsm.X86.RegInfo.mkVirtualReg
, but implementing code generation for all of theMachOp
s (and adding corresponding tests), would constitute the bulk of the work.
Edited by ARATA Mizuki