Skip to content

Integer division is likely to be extremely slow on ARM with integer-native backend

quotRemWord2 is a basic primitive operation for division of arbitrary-long Integer. On x86_64 architecture it's just a single CPU instruction, but ARM architecture mandates full software emulation. However, the code emitted by GHC.StgToCmm.Prim is very inefficient: AFAIU we unconditionally generate 64 blocks, each responsible to determine one bit of the output.

genericWordQuotRem2Op :: Platform -> GenericOp
genericWordQuotRem2Op platform [res_q, res_r] [arg_x_high, arg_x_low, arg_y]
    = emit =<< f (widthInBits (wordWidth platform)) zero arg_x_high arg_x_low
    where    ty = cmmExprType platform arg_x_high
             shl   x i = CmmMachOp (MO_Shl   (wordWidth platform)) [x, i]
             shr   x i = CmmMachOp (MO_U_Shr (wordWidth platform)) [x, i]
             or    x y = CmmMachOp (MO_Or    (wordWidth platform)) [x, y]
             ge    x y = CmmMachOp (MO_U_Ge  (wordWidth platform)) [x, y]
             ne    x y = CmmMachOp (MO_Ne    (wordWidth platform)) [x, y]
             minus x y = CmmMachOp (MO_Sub   (wordWidth platform)) [x, y]
             times x y = CmmMachOp (MO_Mul   (wordWidth platform)) [x, y]
             zero   = lit 0
             one    = lit 1
             negone = lit (fromIntegral (platformWordSizeInBits platform) - 1)
             lit i = CmmLit (CmmInt i (wordWidth platform))

             f :: Int -> CmmExpr -> CmmExpr -> CmmExpr -> FCode CmmAGraph
             f 0 acc high _ = return (mkAssign (CmmLocal res_q) acc <*>
                                      mkAssign (CmmLocal res_r) high)
             f i acc high low =
                 do roverflowedBit <- newTemp ty
                    rhigh'         <- newTemp ty
                    rhigh''        <- newTemp ty
                    rlow'          <- newTemp ty
                    risge          <- newTemp ty
                    racc'          <- newTemp ty
                    let high'         = CmmReg (CmmLocal rhigh')
                        isge          = CmmReg (CmmLocal risge)
                        overflowedBit = CmmReg (CmmLocal roverflowedBit)
                    let this = catAGraphs
                               [mkAssign (CmmLocal roverflowedBit)
                                          (shr high negone),
                                mkAssign (CmmLocal rhigh')
                                          (or (shl high one) (shr low negone)),
                                mkAssign (CmmLocal rlow')
                                          (shl low one),
                                mkAssign (CmmLocal risge)
                                          (or (overflowedBit `ne` zero)
                                              (high' `ge` arg_y)),
                                mkAssign (CmmLocal rhigh'')
                                          (high' `minus` (arg_y `times` isge)),
                                mkAssign (CmmLocal racc')
                                          (or (shl acc one) isge)]
                    rest <- f (i - 1) (CmmReg (CmmLocal racc'))
                                      (CmmReg (CmmLocal rhigh''))
                                      (CmmReg (CmmLocal rlow'))
                    return (this <*> rest)
genericWordQuotRem2Op _ _ _ = panic "genericWordQuotRem2Op"

There are far superior implementations of long division, e. g., udiv128by64to64default in LLVM.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information