Remove target dependent CPP for Word64/Int64 (#11470)

@Ericson2314 This is a stripped down version of your !3658 (closed). I don't change the type of Word64/Int64 to avoid having to add new primops (e.g. quotRemWord64# which I had trouble adding in the past iirc).

added 21 commits

22c7ce58...d73131b9 - 20 commits from branch ghc:master
fb2b6de2 - Remove target dependent CPP for Word64/Int64 (#11470)

Compare with previous version

added 1 commit

aff2fa2b - Remove target dependent CPP for Word64/Int64 (#11470)

Compare with previous version

added 1 commit

9df2be23 - Remove target dependent CPP for Word64/Int64 (#11470)

Compare with previous version

added 1 commit

3e0c6085 - Remove target dependent CPP for Word64/Int64 (#11470)

Compare with previous version

@hsyl20 Wow! Thank you!! Really great to finally be done with this and, have two parallel cross and primop tracks henceforth.

You're welcome :)

There was one regression: it turned out that I forgot to add some constant-folding rules for Word64/Int64! Now fixed too.

mentioned in merge request !5965 (merged)

added 13 commits

3e0c6085...f6f24515 - 12 commits from branch ghc:master
a008dcc0 - Remove target dependent CPP for Word64/Int64 (#11470)

Compare with previous version

added 2 commits

ad30dc1e - Remove target dependent CPP for Word64/Int64 (#11470)
635629b9 - Add missing Int64/Word64 constant-folding rules

Compare with previous version

added 1 commit

e1d3eb50 - Add missing Int64/Word64 constant-folding rules

Compare with previous version

Sadly it seems there are some (spurious?) performance shifts in T12545. Otherwise looks great!

Probably spurious as usual with T12545 when it's below 5% ;)

-3.1% except with unregisterised build (+4.1%)

However the 32-bit build is broken. I'll look into it.

approved this merge request

assigned to @hsyl20

mentioned in commit 8d43878e

mentioned in merge request !6819 (closed)

mentioned in commit 312c2177

mentioned in merge request !6833 (closed)

Small update: I have been trying to debug this on 32-bit arch. The following simple test fails with the stage1 compiler:

main = do
  print (succ (minBound :: Int64))     -- prints 0!
  print (succ (I64# (intToInt64# 13#)) -- prints minBound for Int64

(compiled without -O, otherwise constant folding kicks in and it returns the right result).

Interestingly, hs_add64 gets called with completely bogus arguments. I'll try to find what is causing this.

The following STG:

GHC.Int.$fEnumInt64_$csucc :: GHC.Int.Int64 -> GHC.Int.Int64
[GblId, Arity=1, Str=<1P(1L)>, Cpr=1, Unf=OtherCon []] =
    {} \r [x_s4IZ]
        case x_s4IZ of {
        GHC.Int.I64# x1_s4J1 [Occ=Once1!] ->
        case x1_s4J1 of wild1_s4J2 [Occ=Once1] {
          __DEFAULT ->
              case plusInt64# [wild1_s4J2 1#64] of sat_s4J3 [Occ=Once1] {
              __DEFAULT -> GHC.Int.I64# [sat_s4J3];
              };
          9223372036854775807#64 -> GHC.Int.$fEnumInt9;
        };
        };

turns into the following assembly:

GHC.Int.$fEnumInt64_$csucc_info:
_blk_c8Rz:
        movl (%ebp),%eax
        movl $block_c8Rw_info,(%ebp)
        movl %eax,%esi
        testw $3,%si
        jne _blk_c8Rw
_blk_c8Rx:
        jmp *(%esi)
.align 4,0x90
        .long   GHC.Int.$fEnumInt9_closure-(block_c8Rw_info)+0
        .long   0
        .word   30
        .word   1
block_c8Rw_info:
_blk_c8Rw:
        addl $12,%edi
        cmpl 804(%ebx),%edi
        ja _blk_c8RG
_blk_c8RF:
        movl 3(%esi),%eax             ; EAX = 13
        movl 7(%esi),%ecx             ; ECX = 0
        movl $4294967295,%edx         ; EDX = 0xffffffff
        movl %ecx,64(%esp)            ; stack40 = 0
        movl $2147483647,%ecx         ; ECX = 0x7fffffff
        movl %edx,76(%esp)            ; stack4C = 0xffffffff
        movl 64(%esp),%edx            ; EDX = stack40 = 0
        xorl %ecx,%edx                ; EDX = 0 xor 0x7fffffff = 0x7fffffff
        movl 76(%esp),%ecx            ; ECX = stack4C = 0xffffffff
        xorl %ecx,%eax                ; EAX = 13 xor 0xffffffff = 0xfffffff2
        orl %edx,%eax                 ; EAX = 0xfffffff2 or 0x7fffffff = 0xffffffff
        jne _blk_c8RQ                 ; not zero, jump
_blk_c8RR:
        addl $-12,%edi
        movl $GHC.Int.$fEnumInt9_closure,%esi
        addl $4,%ebp
        jmp *(%esi)
_blk_c8RG:
        movl $12,828(%ebx)
        jmp stg_gc_unpt_r1
_blk_c8RQ:
        subl $12,%esp                ; sub 0xc to stack offset
        movl $1,%ecx                 ; ECX = 1
        movl %edx,76(%esp)           ; stack40 = 0x7fffffff
        movl $0,%edx                 ; EDX = 0
        pushl %edx                   ; push (1 :: Int64) arg
        pushl %ecx
        movl 84(%esp),%edx           ; EDX = stack40 = 0x7fffffff
        pushl %edx                   ; EAX = 0xffffffff, push garbage 0x7fffffffffffffff!
        pushl %eax
        call hs_add64
        addl $28,%esp
        movl $GHC.Int.I64#_con_info,-8(%edi)
        movl %eax,-4(%edi)
        movl %edx,(%edi)
        leal -7(%edi),%esi
        addl $4,%ebp
        jmp *(%ebp)
        .size GHC.Int.$fEnumInt64_$csucc_info, .-GHC.Int.$fEnumInt64_$csucc_info

Comments describe the execution of print (succ (I64# (intToInt64# 13#)). As we can see, registers containing 13 :: Int64# are clobbered before the call to hs_add64.

What happens:

an added rule transforms: eqInt64# x constant into case x of { DEFAULT -> False# ; constant -> True# }
switch implementation in GHC.Cmm.Switch.Implement is unaware that on 32-bit architectures Int64# eq/ne primops are lowered into MO_x64_Eq/MO_x64_Ne instead of MO_Eq/MO_Ne (cf GHC.StgToCmm.Prim.emitPrimOp). It directly generates the following code:

let scrut = CmmMachOp (MO_Ne width) [expr, CmmLit $ CmmInt i width]

       c8RF: // global
           _s4J2::I64 = I64[R1 + 3];
           if (_s4J2::I64 != 9223372036854775807 :: W64) goto c8RQ; else goto c8RR;
       c8RQ: // global
           (_c8RJ::I64) = call MO_x64_Add(_s4J2::I64, 1 :: W64);

it turns out that the NCG knows how to generate comparisons for 64-bit integers on 32-bit architectures (see Note [64-bit integer comparisons on 32-bit] in GHC.CmmToAsm.X86.CodeGen (why aren't we always using them?)). But it seems to compute intermediate results in the original registers instead of in fresh ones, clobbering them for future use:

        cmpExact =
          toOL
            [ XOR II32 (OpReg r2_hi) (OpReg r1_hi)
            , XOR II32 (OpReg r2_lo) (OpReg r1_lo)
            , OR  II32 (OpReg r1_hi)  (OpReg r1_lo)
            , JXX cond true
            , JXX ALWAYS false
            ]

Register liveness confirms this:

       c8RF:
                movl 3(%esi),%vI_n8RT
                    # born:    %vI_n8RT

                movl 7(%esi),%vHi_H8RT
                    # born:    %vHi_H8RT

                movl %vI_n8RT,%vI_s4J2
                    # born:    %vI_s4J2
                    # r_dying: %vI_n8RT

                movl %vHi_H8RT,%vHi_H4J2
                    # born:    %vHi_H4J2
                    # r_dying: %vHi_H8RT

                movl $4294967295,%vI_n8RU
                    # born:    %vI_n8RU

                movl $2147483647,%vHi_H8RU
                    # born:    %vHi_H8RU

                xorl %vHi_H8RU,%vHi_H4J2
                    # r_dying: %vHi_H8RU

                xorl %vI_n8RU,%vI_s4J2
                    # r_dying: %vI_n8RU

                orl %vHi_H4J2,%vI_s4J2

                jne _blk_c8RQ
                    # r_dying: %vHi_H4J2 %vI_s4J2

                jmp _blk_c8RR

     NONREC
        c8RG:
                movl $12,828(%ebx)

                jmp stg_gc_unpt_r1
                     ,
     NONREC
        c8RQ:
                subl $12,%esp

                movl $1,%vI_n8RV
                    # born:    %vI_n8RV

                movl $0,%vHi_H8RV
                    # born:    %vHi_H8RV

                pushl %vHi_H8RV
                    # r_dying: %vHi_H8RV

                pushl %vI_n8RV
                    # r_dying: %vI_n8RV

                pushl %vHi_H4J2
                    # r_dying: %vHi_H4J2

                pushl %vI_s4J2
                    # r_dying: %vI_s4J2

                call hs_add64
                    # born:    %r0 %r2 %r3 %r16 %r17 %r18 %r19 %r20 %r21 %r22 %r23
                    # w_dying: %r2 %r16 %r17 %r18 %r19 %r20 %r21 %r22 %r23

I'll push a fix.

added 19 commits

e1d3eb50...0f7541dc - 16 commits from branch ghc:master
20726dce - Remove target dependent CPP for Word64/Int64 (#11470)
eb1224c6 - Add missing Int64/Word64 constant-folding rules
14ea1d5a - i386: fix codegen of 64-bit comparisons

Compare with previous version

Remove target dependent CPP for Word64/Int64 (#11470)

Merge request reports

Activity