timesWord2# is very slow on ARM

I've been investigating a rapid performance degradation of Text builder for decimal numbers from text-builder-linear package.

The stock Data.Text.Lazy.Builder.Int from text converts decimals to strings using quotRem n 100. Division is an extremely slow operation, but fortunately division by constants can be replaced by wide multiplication, see quote-quot, #9786, !8392. That's exactly what text-builder-linear does: instead of dividing by 100 it uses

quot100 (W# n) = case timesWord2# n 5165088340638674453## of
  (# hi, _ #) -> W# (uncheckedShiftRL#
    (plusWord# (uncheckedShiftRL# (minusWord# n hi) 1#) hi) 6#)

This code is ~3x faster than quotRem on x86_64. However, things change drastically on ARM, where it gets actually ~2x slower than quotRem, because GHC goes for full software emulation in GHC.StgToCmm.Prim:

genericWordMul2Op :: GenericOp
genericWordMul2Op [res_h, res_l] [arg_x, arg_y]
 = do platform <- getPlatform
      let t = cmmExprType platform arg_x
      xlyl <- liftM CmmLocal $ newTemp t
      xlyh <- liftM CmmLocal $ newTemp t
      xhyl <- liftM CmmLocal $ newTemp t
      r    <- liftM CmmLocal $ newTemp t
      -- This generic implementation is very simple and slow. We might
      -- well be able to do better, but for now this at least works.
      let topHalf x = CmmMachOp (MO_U_Shr (wordWidth platform)) [x, hww]
          toTopHalf x = CmmMachOp (MO_Shl (wordWidth platform)) [x, hww]
          bottomHalf x = CmmMachOp (MO_And (wordWidth platform)) [x, hwm]
          add x y = CmmMachOp (MO_Add (wordWidth platform)) [x, y]
          sum = foldl1 add
          mul x y = CmmMachOp (MO_Mul (wordWidth platform)) [x, y]
          or x y = CmmMachOp (MO_Or (wordWidth platform)) [x, y]
          hww = CmmLit (CmmInt (fromIntegral (widthInBits (halfWordWidth platform)))
                               (wordWidth platform))
          hwm = CmmLit (CmmInt (halfWordMask platform) (wordWidth platform))
      emit $ catAGraphs
             [mkAssign xlyl
                  (mul (bottomHalf arg_x) (bottomHalf arg_y)),
              mkAssign xlyh
                  (mul (bottomHalf arg_x) (topHalf arg_y)),
              mkAssign xhyl
                  (mul (topHalf arg_x) (bottomHalf arg_y)),
              mkAssign r
                  (sum [topHalf    (CmmReg xlyl),
                        bottomHalf (CmmReg xhyl),
                        bottomHalf (CmmReg xlyh)]),
              mkAssign (CmmLocal res_l)
                  (or (bottomHalf (CmmReg xlyl))
                      (toTopHalf (CmmReg r))),
              mkAssign (CmmLocal res_h)
                  (sum [mul (topHalf arg_x) (topHalf arg_y),
                        topHalf (CmmReg xhyl),
                        topHalf (CmmReg xlyh),
                        topHalf (CmmReg r)])]

However, ARM does provide UMULH instruction, returning the high 64 bits of the product of two 64-bit unsigned integers, which would be much faster.

The suggestion is

Provide an ARM-specific implementation for timesWord2# in GHC.StgToCmm.Prim, and/or
Add a new primop to return the high bits of the product only. The motivating case above actually does not use the lower half.

CC @AndreasK @1Jajen1

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information