GHC 9.4.2 regresses being able to do math on aarch64

added needs triage label

I can confirm that this is a backend issue, apparently quite late into the pipeline—the generated Cmm for the miscompiled block is identical on x86_64. Here’s a somewhat nicer test case:

{-# OPTIONS_GHC -O1 #-}
module A where
import Data.Word

testF :: Word8 -> Word8 -> Word8
testF a b =
  if b == 0
    then a
    else (a `div` b) * b + (a `mod` b)
{-# NOINLINE testF #-}

module Main where
import A

main :: IO ()
main = putStrLn . show $ testF 217 161

This prints 217 on x86_64, but 122 on aarch64. Conveniently, the generated STG and Cmm for testF’s worker are both tiny:

A.$wtestF [InlPrag=NOINLINE]
  :: GHC.Prim.Word8# -> GHC.Prim.Word8# -> GHC.Prim.Word8#
[GblId, Arity=2, Str=<L><L>, Unf=OtherCon []] =
    {} \r [ww_s153 ww1_s154]
        case word8ToWord# [ww1_s154] of {
          __DEFAULT ->
              case quotWord8# [ww_s153 ww1_s154] of wild3_s156 [Occ=Once1] {
              __DEFAULT ->
              case remWord8# [ww_s153 ww1_s154] of wild1_s157 [Occ=Once1] {
              __DEFAULT ->
              case timesWord8# [wild3_s156 ww1_s154] of sat_s158 [Occ=Once1] {
              __DEFAULT -> plusWord8# [sat_s158 wild1_s157];
              };
              };
              };
          0## -> ww_s153;
        };

[A.$wtestF_entry() { //  [R3, R2]
         { info_tbls: [(c15p,
                        label: A.$wtestF_info
                        rep: HeapRep static { Fun {arity: 2 fun_type: ArgSpec 12} }
                        srt: Nothing)]
           stack_info: arg_space: 8
         }
     {offset
       c15p: // global
           _s154::I8 = %MO_XX_Conv_W64_W8(R3);
           _s153::I8 = %MO_XX_Conv_W64_W8(R2);
           if (%MO_UU_Conv_W8_W64(_s154::I8) != 0) goto c15n; else goto c15o;
       c15n: // global
           R1 = %MO_XX_Conv_W8_W64(_s153::I8 / _s154::I8 * _s154::I8 + _s153::I8 % _s154::I8);
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c15o: // global
           R1 = %MO_XX_Conv_W8_W64(_s153::I8);
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
 },
 section ""data" . A.$wtestF_closure" {
     A.$wtestF_closure:
         const A.$wtestF_info;
 }]

This keeps the aarch64 assembly pretty small:

.section .text
	.balign 8
	.quad	8589934604
	.quad	0
	.long	14
	.long	0
	.globl A.$wtestF_info
.type A.$wtestF_info, @function
A.$wtestF_info:
_blk_c15p:
	mov w18, w24
	mov w17, w23
	ubfm x15, x18, #0, #7
	cbnz x15, _blk_c15n
_blk_c15o:
	mov x22, x17
	ldr x18, [ x20 ]
	br x18
_blk_c15n:
	uxtb w17, w17
	uxtb w18, w18
	udiv w15, w17, w18
	sxtb w15, w15
	sxtb w18, w18
	mul w15, w15, w18
	ubfm w15, w15, #0, #7
	udiv w14, w17, w18
	msub w18, w14, w18, w17
	ubfm w18, w18, #0, #7
	add w22, w15, w18
	ubfm w22, w22, #0, #7
	ldr x18, [ x20 ]
	br x18
	.size A.$wtestF_info, .-A.$wtestF_info

I haven’t bothered yet to analyze the assembly output to see how/why it’s wrong, but presumably the issue is in there somewhere.

I'm not too familiar with arm assembly, but I think this is the culprit

_blk_c15n:
	uxtb w17, w17
	uxtb w18, w18
	udiv w15, w17, w18
	sxtb w15, w15 ;; This signed extend is harmless but shouldn't be here
	sxtb w18, w18 ;; This signed extend here is outright wrong
	mul w15, w15, w18
	ubfm w15, w15, #0, #7
	udiv w14, w17, w18 ;; Causes this division to be 0 instead 1, since it's dividing 217 by 4294967201 instead of 151
	msub w18, w14, w18, w17
	ubfm w18, w18, #0, #7
	add w22, w15, w18
	ubfm w22, w22, #0, #7
	ldr x18, [ x20 ]
	br x18
	.size A.$wtestF_info, .-A.$wtestF_info

Now the interesting part would be (a) does this happen with -fllvm as well, what's the assembly the llvm codegen produces? (b) more importantly though, as this is supposedly only in 9.4, what's the assembly that 9.6(master) and 9.2.4 generate for this? The NCG is in GHC since 9.2

I’ve just tested on 9.2.4, and indeed, the above program does not trigger the bug. However, GHC generates considerably different STG:

A.$wtestF [InlPrag=NOINLINE]
  :: GHC.Prim.Word8# -> GHC.Prim.Word8# -> GHC.Prim.Word8#
[GblId, Arity=2, Str=<L><L>, Unf=OtherCon []] =
    {} \r [ww_s12s ww1_s12t]
        case word8ToWord# [ww1_s12t] of wild_s12u {
          __DEFAULT ->
              case word8ToWord# [ww_s12s] of sat_s12v [Occ=Once1] {
              __DEFAULT ->
              case quotWord# [sat_s12v wild_s12u] of wild3_s12w [Occ=Once1] {
              __DEFAULT ->
              case word8ToWord# [ww_s12s] of sat_s12x [Occ=Once1] {
              __DEFAULT ->
              case remWord# [sat_s12x wild_s12u] of wild1_s12y [Occ=Once1] {
              __DEFAULT ->
              case and# [wild1_s12y 255##] of sat_s12C [Occ=Once1] {
              __DEFAULT ->
              case and# [wild3_s12w 255##] of sat_s12z [Occ=Once1] {
              __DEFAULT ->
              case timesWord# [sat_s12z wild_s12u] of sat_s12A [Occ=Once1] {
              __DEFAULT ->
              case and# [sat_s12A 255##] of sat_s12B [Occ=Once1] {
              __DEFAULT ->
              case plusWord# [sat_s12B sat_s12C] of sat_s12D [Occ=Once1] {
              __DEFAULT -> wordToWord8# [sat_s12D];
              };
              };
              };
              };
              };
              };
              };
              };
              };
          0## -> ww_s12s;
        };

We can write the version GHC 9.4.2 produces by hand:

{-# OPTIONS_GHC -O1 #-}
{-# LANGUAGE MagicHash #-}
module A where

import Data.Word
import GHC.Prim
import GHC.Word

wtestF :: GHC.Prim.Word8# -> GHC.Prim.Word8# -> GHC.Prim.Word8#
wtestF a b = case word8ToWord# b of
  0## -> a
  _   -> plusWord8# (timesWord8# (quotWord8# a b) b) (remWord8# a b)
{-# NOINLINE wtestF #-}

testF :: Word8 -> Word8 -> Word8
testF (W8# a) (W8# b) = W8# (wtestF a b)
{-# INLINE testF #-}

Now GHC 9.2.4 fails in precisely the same way that GHC 9.4.2 does. So it seems GHC 9.2 is still buggy, but in this case, it gets lucky.

@lexi.lambda thanks for this! This is quite concerning, indeed.

I've added some more comments to the assembly:

_blk_c15n:
    # _s153::I8 / _s154::I8
    # '- x17 -'   '- x18 -'
    # '------- x15 -------'
	uxtb w17, w17 ;; store x17[0:31] <- unsigned extend x17[0:7] -- _s153 extended to 32bit 
	uxtb w18, w18 ;; store x18[0:31] <- unsigned extend x18[0:7] -- _s154 extended to 32bit
	udiv w15, w17, w18  ;; x15[0:31] <- x17[0:31] / x17[0:31]    -- _s153 / _s154 


    # _s153::I8 / _s154::I8 * _s154::I8
    # '-------- x15 ------'   '- x18 -'
    # '------------- x15 -------------'
	sxtb w15, w15 ;; store x15[0:31] <- sign extend x15[0:7] -- This signed extend is harmless but shouldn't be here
	sxtb w18, w18 ;; store x18[0:31] <- sign extend x18[0:7] -- This signed extend here is outright wrong    
	mul w15, w15, w18   ;; x15[0:31] <- x15[0:31] * x18[0:31]
	ubfm w15, w15, #0, #7 ;; x[8:31] <- zero -- truncate byte. The unsigned bitfield move instruction breaks my brain

    # _s153::I8 % _s154::I8
    # '- x17 -'   '- x18 -'
    # '------ x18 --------'
    # remember x17 and x18 have been unsigned extended; x18 has been sign extended as well.
	udiv w14, w17, w18 ;; x14[0:31] <- x17[0:31] / x18[0:31] -- Causes this division to be 0 instead 1, since it's dividing 217 by 4294967201 instead of 151
	msub w18, w14, w18, w17 ;; x18[0:31] <- x17[0:31] - x14[0:31] * x18[0:31]
	ubfm w18, w18, #0, #7 ;; x18[8:31] <- zero

    # _s153::I8 / _s154::I8 * _s154::I8 + _s153::I8 % _s154::I8
    # '------------- x15 -------------'   '------ x18 --------'
    # '--------------------------- x22 (R1) ------------------'
    add w22, w15, w18 ;; x22[0:31] <- x15[0:31] + x18[0:31]
	ubfm w22, w22, #0, #7 ;; x22[8:31] <- zero

    ;; load address at x20, and branch.
    ldr x18, [ x20 ] 
	br x18
	.size A.$wtestF_info, .-A.$wtestF_info

I’ve also just tested with -fllvm, and that does indeed mitigate the bug, so the issue is definitely in the NCG. However, LLVM just optimizes the entirety of wtestF away, so the assembly it produces is not particularly informative:

A_wtestF_info$def:
// %bb.0:                               // %nF6
        ldr     x8, [x20]
        and     x22, x23, #0xff
        blr     x8
        ret

If we don’t run opt first, then we instead get

A_wtestF_info$def:
// %bb.0:                               // %nF6
        str     x22, [sp]
        tst     w24, #0xff
        and     x22, x23, #0xff
        strb    w24, [sp, #12]
        strb    w23, [sp, #8]
        b.eq    .LBB0_2
// %bb.1:                               // %cEH
        ldrb    w8, [sp, #12]
        ldrb    w9, [sp, #8]
        udiv    w10, w22, w8
        udiv    w11, w9, w8
        msub    w9, w11, w8, w9
        ldr     x11, [x20]
        madd    w8, w10, w8, w9
        and     x22, x8, #0xff
        str     x22, [sp]
        blr     x11
        ret
.LBB0_2:                                // %cEI
        ldr     x8, [x20]
        str     x22, [sp]
        blr     x8
        ret

but this isn’t super meaningful, seeing as we’ve disabled the optimizer. Even just running opt -mem2reg is enough for llc -O1 to delete almost everything, and the output of llc -O0 isn’t very interesting.

@lexi.lambda well it gives us two insights

the llvm backend is unaffected.
we can aim to tune our NCG for this.

Here's the manually evaluated assembly

_blk_c15n:
    # _s153::I8 / _s154::I8
    # '- x17 -'   '- x18 -'
    # '------- x15 -------'
	uxtb w17, w17       ;; w17 <- 217 (b1101_1101)
	uxtb w18, w18       ;; w18 <- 161 (b1010_0001)
	udiv w15, w17, w18  ;; w15 <- 1   (b0000_0001)


    # _s153::I8 / _s154::I8 * _s154::I8
    # '-------- x15 ------'   '- x18 -'
    # '------------- x15 -------------'
	sxtb w15, w15       ;; w15 <- 1          (b0000_0001)
	sxtb w18, w18       ;; w18 <- 4294967201 (b1111_1111_1111_1111_1111_1111_1010_0001)
	mul w15, w15, w18   ;; w15 <- 4294967201 (b1111_1111_1111_1111_1111_1111_1010_0001)
	ubfm w15, w15, #0, #7 ;; w15 <- 161      (b1010_0001)

    # _s153::I8 % _s154::I8
    # '- x17 -'   '- x18 -'
    # '------ x18 --------'
    # remember x17 and x18 have been unsigned extended; x18 has been sign extended as well.
	udiv w14, w17, w18      ;; w14 <- 0          (b0000_0000)
	msub w18, w14, w18, w17 ;; w18 <- 217 - 0 * 4294967201 = 217 (b1101_1101)
	ubfm w18, w18, #0, #7   ;; w18 <- 217 (b1101_1101)

    # _s153::I8 / _s154::I8 * _s154::I8 + _s153::I8 % _s154::I8
    # '------------- x15 -------------'   '------ x18 --------'
    # '--------------------------- x22 (R1) ------------------'
    add w22, w15, w18     ;; w22 <- 161 + 217 = 378 (b0001_0111_1010)
	ubfm w22, w22, #0, #7 ;; w22 <- 122 (b01110_1010)

    ;; load address at x20, and branch.
    ldr x18, [ x20 ] 
	br x18
	.size A.$wtestF_info, .-A.$wtestF_info

and if we remove the erroneous line @LudvikGalois pointed out

_blk_c15n:
    # _s153::I8 / _s154::I8
    # '- x17 -'   '- x18 -'
    # '------- x15 -------'
	uxtb w17, w17       ;; w17 <- 217 (b1101_1101)
	uxtb w18, w18       ;; w18 <- 161 (b1010_0001)
	udiv w15, w17, w18  ;; w15 <- 1   (b0000_0001)


    # _s153::I8 / _s154::I8 * _s154::I8
    # '-------- x15 ------'   '- x18 -'
    # '------------- x15 -------------'
	sxtb w15, w15       ;; w15 <- 1          (b0000_0001)
	# sxtb w18, w18       ;; w18 <- 4294967201 (b1111_1111_1111_1111_1111_1111_1010_0001)
	mul w15, w15, w18     ;; w15 <- 161      (b1010_0001)
	ubfm w15, w15, #0, #7 ;; w15 <- 161      (b1010_0001)

    # _s153::I8 % _s154::I8
    # '- x17 -'   '- x18 -'
    # '------ x18 --------'
    # remember x17 and x18 have been unsigned extended; x18 has been sign extended as well.
	udiv w14, w17, w18      ;; w14 <- 1      (b0000_0001)
	msub w18, w14, w18, w17 ;; w18 <- 217 - 1 * 161 = 56 (b0011_1000)
	ubfm w18, w18, #0, #7   ;; w18 <- 56 (b0011_1000)

    # _s153::I8 / _s154::I8 * _s154::I8 + _s153::I8 % _s154::I8
    # '------------- x15 -------------'   '------ x18 --------'
    # '--------------------------- x22 (R1) ------------------'
    add w22, w15, w18     ;; w22 <- 161 + 56 = 217 (b1101_1001)
	ubfm w22, w22, #0, #7 ;; w22 <- 217 (b1101_1001)

    ;; load address at x20, and branch.
    ldr x18, [ x20 ] 
	br x18
	.size A.$wtestF_info, .-A.$wtestF_info

So we have

        -- Signed multiply/divide
        MO_Mul w          -> intOp True w (\d x y -> unitOL $ MUL d x y)
        MO_S_MulMayOflo w -> do_mul_may_oflo w x y
        MO_S_Quot w       -> intOp True w (\d x y -> unitOL $ SDIV d x y)

and

          -- A (potentially signed) integer operation.
          -- In the case of 8- and 16-bit signed arithmetic we must first
          -- sign-extend both arguments to 32-bits.
          -- See Note [Signed arithmetic on AArch64].
          intOp is_signed w op = do
              -- compute x<m> <- x
              -- compute x<o> <- y
              -- <OP> x<n>, x<m>, x<o>
              (reg_x, format_x, code_x) <- getSomeReg x
              (reg_y, format_y, code_y) <- getSomeReg y
              massertPpr (isIntFormat format_x && isIntFormat format_y) $ text "intOp: non-int"
              -- This is the width of the registers on which the operation
              -- should be performed.
              let w' = opRegWidth w
                  signExt r
                    | not is_signed  = nilOL
                    | otherwise      = signExtendReg w w' r
              return $ Any (intFormat w) $ \dst ->
                  code_x `appOL`
                  code_y `appOL`
                  -- sign-extend both operands
                  signExt reg_x `appOL`
                  signExt reg_y `appOL`
                  op (OpReg w' dst) (OpReg w' reg_x) (OpReg w' reg_y) `appOL`
                  truncateReg w' w dst -- truncate back to the operand's original width

So, am I correct in assuming what's going on is that the value in a register being sign-extended because of the multiplication, and then re-used for the division after being mutated with no corrective truncation? Can this be truncating reg_x and reg_y if the op is signed, or are they allowed to contain values of more than w bits?

This is the note, for reference here:

-- Note [Signed arithmetic on AArch64]
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- Handling signed arithmetic on sub-word-size values on AArch64 is a bit
-- tricky as Cmm's type system does not capture signedness. While 32-bit values
-- are fairly easy to handle due to AArch64's 32-bit instruction variants
-- (denoted by use of %wN registers), 16- and 8-bit values require quite some
-- care.
--
-- We handle 16-and 8-bit values by using the 32-bit operations and
-- sign-/zero-extending operands and truncate results as necessary. For
-- simplicity we maintain the invariant that a register containing a
-- sub-word-size value always contains the zero-extended form of that value
-- in between operations.
--
-- For instance, consider the program,
--
--    test(bits64 buffer)
--      bits8 a = bits8[buffer];
--      bits8 b = %mul(a, 42);
--      bits8 c = %not(b);
--      bits8 d = %shrl(c, 4::bits8);
--      return (d);
--    }
--
-- This program begins by loading `a` from memory, for which we use a
-- zero-extended byte-size load.  We next sign-extend `a` to 32-bits, and use a
-- 32-bit multiplication to compute `b`, and truncate the result back down to
-- 8-bits.
--
-- Next we compute `c`: The `%not` requires no extension of its operands, but
-- we must still truncate the result back down to 8-bits. Finally the `%shrl`
-- requires no extension and no truncate since we can assume that
-- `c` is zero-extended.
--
-- TODO:
--   Don't use Width in Operands
--   Instructions should rather carry a RegWidth
--

Annotated llvm assembly for completeness:

A_wtestF_info$def:
// %bb.0:                               // %nF6
        str     x22, [sp]
        tst     w24, #0xff              ;; set testbit for w24 & 0xff
        and     x22, x23, #0xff         ;; x22 <- x23 & 0xff
        strb    w24, [sp, #12]          ;; *sp+12 <- w24[0:7] -- s154::I8
        strb    w23, [sp, #8]           ;; *sp+8  <- w23[0:7] -- s153::I8
        b.eq    .LBB0_2                 ;; if w24 & 0xff == 0 then short
// %bb.1:                               // %cEH
        ldrb    w8, [sp, #12]           ;; w8[0:7] <- *sp+12  -- s154::I8
        ldrb    w9, [sp, #8]            ;; w9[0:8] <- *sp+8   -- s153::I8
        udiv    w10, w22, w8            ;; w10 = w22 / w8 = (w9 & 0xff) / w8 = (s153&0xff) / s154
        udiv    w11, w9, w8             ;; w11 = w9 / w8                     =  s153       / s154
        msub    w9, w11, w8, w9         ;; w9 <- w9 - w11 * w8               =  s153 - (s153 / s154) * s154
        ldr     x11, [x20]              ;; x11 <- *x20
        madd    w8, w10, w8, w9         ;; w8 <- w9 + w10 * w8               =  (s153 - (s153 / s154) * s154) + ((s153&0xff) / s154) * s154
        and     x22, x8, #0xff          ;; x22 <- x8 & 0xff     -- trunacte to byte
        str     x22, [sp]               ;; save x22 on the stack
        blr     x11                     ;; branch to continuation
        ret
.LBB0_2:                                // %cEI
        ldr     x8, [x20]
        str     x22, [sp]
        blr     x8
        ret

way less extensions :-)

added incorrect runtime result label

added Phighest label and removed incorrect runtime result label

removed needs triage label

added NCG backend Tbug aarch64 incorrect runtime result labels

In 9.4 we started using sized primitives for arithmetic, in 9.2 we used Int# and truncated the result/argument. Which is likely why this doesn't happen in 9.2

@angerman Are you looking into fixing this?

changed milestone to %9.4.3

@AndreasK ill try to come up with a fix. I’ve asked @hsyl20 for some ideas.

If someone else has some ideas how to handle this properly in the generic case, I’m happy for a discussion!

GHC 9.4.2 regresses being able to do math on aarch64

Summary

Steps to reproduce

Expected behavior

Environment

Child items ...

Activity