timesWord2# is very slow on ARM
I've been investigating a rapid performance degradation of Text
builder for decimal numbers from text-builder-linear
package.
The stock Data.Text.Lazy.Builder.Int
from text
converts decimals to strings using quotRem n 100
. Division is an extremely slow operation, but fortunately division by constants can be replaced by wide multiplication, see quote-quot
, #9786, !8392. That's exactly what text-builder-linear
does: instead of dividing by 100 it uses
quot100 (W# n) = case timesWord2# n 5165088340638674453## of
(# hi, _ #) -> W# (uncheckedShiftRL#
(plusWord# (uncheckedShiftRL# (minusWord# n hi) 1#) hi) 6#)
This code is ~3x faster than quotRem
on x86_64. However, things change drastically on ARM, where it gets actually ~2x slower than quotRem
, because GHC goes for full software emulation in GHC.StgToCmm.Prim
:
genericWordMul2Op :: GenericOp
genericWordMul2Op [res_h, res_l] [arg_x, arg_y]
= do platform <- getPlatform
let t = cmmExprType platform arg_x
xlyl <- liftM CmmLocal $ newTemp t
xlyh <- liftM CmmLocal $ newTemp t
xhyl <- liftM CmmLocal $ newTemp t
r <- liftM CmmLocal $ newTemp t
-- This generic implementation is very simple and slow. We might
-- well be able to do better, but for now this at least works.
let topHalf x = CmmMachOp (MO_U_Shr (wordWidth platform)) [x, hww]
toTopHalf x = CmmMachOp (MO_Shl (wordWidth platform)) [x, hww]
bottomHalf x = CmmMachOp (MO_And (wordWidth platform)) [x, hwm]
add x y = CmmMachOp (MO_Add (wordWidth platform)) [x, y]
sum = foldl1 add
mul x y = CmmMachOp (MO_Mul (wordWidth platform)) [x, y]
or x y = CmmMachOp (MO_Or (wordWidth platform)) [x, y]
hww = CmmLit (CmmInt (fromIntegral (widthInBits (halfWordWidth platform)))
(wordWidth platform))
hwm = CmmLit (CmmInt (halfWordMask platform) (wordWidth platform))
emit $ catAGraphs
[mkAssign xlyl
(mul (bottomHalf arg_x) (bottomHalf arg_y)),
mkAssign xlyh
(mul (bottomHalf arg_x) (topHalf arg_y)),
mkAssign xhyl
(mul (topHalf arg_x) (bottomHalf arg_y)),
mkAssign r
(sum [topHalf (CmmReg xlyl),
bottomHalf (CmmReg xhyl),
bottomHalf (CmmReg xlyh)]),
mkAssign (CmmLocal res_l)
(or (bottomHalf (CmmReg xlyl))
(toTopHalf (CmmReg r))),
mkAssign (CmmLocal res_h)
(sum [mul (topHalf arg_x) (topHalf arg_y),
topHalf (CmmReg xhyl),
topHalf (CmmReg xlyh),
topHalf (CmmReg r)])]
However, ARM does provide UMULH
instruction, returning the high 64 bits of the product of two 64-bit unsigned integers, which would be much faster.
The suggestion is
- Provide an ARM-specific implementation for
timesWord2#
inGHC.StgToCmm.Prim
, and/or - Add a new primop to return the high bits of the product only. The motivating case above actually does not use the lower half.