NCG generates slow loop code

In http://stackoverflow.com/a/23322255/263061 we compile the code

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

and compare the assembly generated by the NCG and -fllvm.

The LLVM code is 10 times faster.

While the LLVM code looks great, NCG one generated looks quite unfortunate and messy, with a lot of unnecessary moving around.

I am by no means an expert in this, but some of it looks like low hanging potential improvements.

This is the LLVM code:

 Main_zdwa_info:
.LBB1_1:
    movl    $4294967295, %esi           /* load the loop bound */
    movabsq $-6067343680855748867, %rdi /* load a magic number for the modulus */
    jmp .LBB1_2
.LBB1_4:              
    incl    %ecx                        /* loop core start */
.LBB1_2:  
    cmpq    %rsi, %rcx
    je  .LBB1_6                         /* check loop bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4                         /* loop core end */
    /* Code omitted: print, then return to loop beginning */
.LBB1_6:

So the core of the loop is a nice and short:

incl
cmpq
je

movq [division replaced by magic multiplication]
mulq
shrq
imulq

cmpq
jne

Now what NCG produces:

Main_zdwa_info:           /* loop core start */
.Lc3JD:
    leaq -16(%rbp),%rax   /* stack check */
    cmpq %r15,%rax
    jb .Lc3JO             /* jump not taken in the normal case */
.Lc3JP:
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
.Lc3JC:
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax  /* issue: unnecessary moves */
    movq %rax,%rbx        /* and loading the modulo arg on every iteration */
    movq %r14,%rax
    xorq %rdx,%rdx        /* shorter alternative to mov $0,%rdx */
    divq %rbx             /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU            /* This jump is usually taken. */
.Lc3JV: 
   /* do the printing. Code omitted. Not relevant */
.Lc3JN:
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax     /* This code is not very relevant because we don't almost never */
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
.Lc3JU:
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

So the core of this loop is

movl [stack check]
cmpq
jne

movl
movq
movq
xorq
divq
testq
jne

leaq
movl
jmp

Some issues here:

the stack check is inside the loop
the loop limit is loaded every time inside the loop
the modulo arg is loaded every time inside the loop
we are shuffling around registers
no strength reduction (div is not replaced by cheaper magic number imul)
weird increment using leaq

It would be great if somebody with codegen insight could comment on whether these are easy targets for improvement!

Trac metadata

Trac field	Value
Version	7.6.3
Type	Bug
TypeOfFailure	CompileTimePerformance
Priority	normal
Resolution	Unresolved
Component	Compiler (NCG)
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	mail@nh2.me, simonmar
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

NCG generates slow loop code