Integer division is likely to be extremely slow on ARM with integer-native backend
quotRemWord2
is a basic primitive operation for division of arbitrary-long Integer
. On x86_64 architecture it's just a single CPU instruction, but ARM architecture mandates full software emulation. However, the code emitted by GHC.StgToCmm.Prim
is very inefficient: AFAIU we unconditionally generate 64 blocks, each responsible to determine one bit of the output.
genericWordQuotRem2Op :: Platform -> GenericOp
genericWordQuotRem2Op platform [res_q, res_r] [arg_x_high, arg_x_low, arg_y]
= emit =<< f (widthInBits (wordWidth platform)) zero arg_x_high arg_x_low
where ty = cmmExprType platform arg_x_high
shl x i = CmmMachOp (MO_Shl (wordWidth platform)) [x, i]
shr x i = CmmMachOp (MO_U_Shr (wordWidth platform)) [x, i]
or x y = CmmMachOp (MO_Or (wordWidth platform)) [x, y]
ge x y = CmmMachOp (MO_U_Ge (wordWidth platform)) [x, y]
ne x y = CmmMachOp (MO_Ne (wordWidth platform)) [x, y]
minus x y = CmmMachOp (MO_Sub (wordWidth platform)) [x, y]
times x y = CmmMachOp (MO_Mul (wordWidth platform)) [x, y]
zero = lit 0
one = lit 1
negone = lit (fromIntegral (platformWordSizeInBits platform) - 1)
lit i = CmmLit (CmmInt i (wordWidth platform))
f :: Int -> CmmExpr -> CmmExpr -> CmmExpr -> FCode CmmAGraph
f 0 acc high _ = return (mkAssign (CmmLocal res_q) acc <*>
mkAssign (CmmLocal res_r) high)
f i acc high low =
do roverflowedBit <- newTemp ty
rhigh' <- newTemp ty
rhigh'' <- newTemp ty
rlow' <- newTemp ty
risge <- newTemp ty
racc' <- newTemp ty
let high' = CmmReg (CmmLocal rhigh')
isge = CmmReg (CmmLocal risge)
overflowedBit = CmmReg (CmmLocal roverflowedBit)
let this = catAGraphs
[mkAssign (CmmLocal roverflowedBit)
(shr high negone),
mkAssign (CmmLocal rhigh')
(or (shl high one) (shr low negone)),
mkAssign (CmmLocal rlow')
(shl low one),
mkAssign (CmmLocal risge)
(or (overflowedBit `ne` zero)
(high' `ge` arg_y)),
mkAssign (CmmLocal rhigh'')
(high' `minus` (arg_y `times` isge)),
mkAssign (CmmLocal racc')
(or (shl acc one) isge)]
rest <- f (i - 1) (CmmReg (CmmLocal racc'))
(CmmReg (CmmLocal rhigh''))
(CmmReg (CmmLocal rlow'))
return (this <*> rest)
genericWordQuotRem2Op _ _ _ = panic "genericWordQuotRem2Op"
There are far superior implementations of long division, e. g., udiv128by64to64default
in LLVM.