NCG generates slow loop code
In http://stackoverflow.com/a/23322255/263061 we compile the code
main :: IO ()
main = do
loop (maxBound :: Word32) $ \i -> do
when (i `rem` 100000000 == 0) $
putStrLn "foo"
and compare the assembly generated by the NCG and -fllvm
.
The LLVM code is 10 times faster.
While the LLVM code looks great, NCG one generated looks quite unfortunate and messy, with a lot of unnecessary mov
ing around.
I am by no means an expert in this, but some of it looks like low hanging potential improvements.
This is the LLVM code:
Main_zdwa_info:
.LBB1_1:
movl $4294967295, %esi /* load the loop bound */
movabsq $-6067343680855748867, %rdi /* load a magic number for the modulus */
jmp .LBB1_2
.LBB1_4:
incl %ecx /* loop core start */
.LBB1_2:
cmpq %rsi, %rcx
je .LBB1_6 /* check loop bound */
/* do the modulus with two multiplications, a shift and a magic number */
/* note : gcc does the same reduction */
movq %rcx, %rax
mulq %rdi
shrq $26, %rdx
imulq $100000000, %rdx, %rax
cmpq %rax, %rcx
jne .LBB1_4 /* loop core end */
/* Code omitted: print, then return to loop beginning */
.LBB1_6:
So the core of the loop is a nice and short:
incl
cmpq
je
movq [division replaced by magic multiplication]
mulq
shrq
imulq
cmpq
jne
Now what NCG produces:
Main_zdwa_info: /* loop core start */
.Lc3JD:
leaq -16(%rbp),%rax /* stack check */
cmpq %r15,%rax
jb .Lc3JO /* jump not taken in the normal case */
.Lc3JP:
movl $4294967295,%eax /* issue: loading the bound on every iteration */
cmpq %rax,%r14
jne .Lc3JB
.Lc3JC:
/* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
movl $100000000,%eax /* issue: unnecessary moves */
movq %rax,%rbx /* and loading the modulo arg on every iteration */
movq %r14,%rax
xorq %rdx,%rdx /* shorter alternative to mov $0,%rdx */
divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
testq %rdx,%rdx
jne .Lc3JU /* This jump is usually taken. */
.Lc3JV:
/* do the printing. Code omitted. Not relevant */
.Lc3JN:
/* increment index and (I guess) restore registers messed up by the printing */
movq 8(%rbp),%rax /* This code is not very relevant because we don't almost never */
incq %rax
movl %eax,%r14d
addq $16,%rbp
jmp Main_zdwa_info
.Lc3JU:
leaq 1(%r14),%rax /*issue: why not just increment r14? */
movl %eax,%r14d
jmp Main_zdwa_info
So the core of this loop is
movl [stack check]
cmpq
jne
movl
movq
movq
xorq
divq
testq
jne
leaq
movl
jmp
Some issues here:
- the stack check is inside the loop
- the loop limit is loaded every time inside the loop
- the modulo arg is loaded every time inside the loop
- we are shuffling around registers
- no strength reduction (div is not replaced by cheaper magic number imul)
- weird increment using leaq
It would be great if somebody with codegen insight could comment on whether these are easy targets for improvement!
Trac metadata
Trac field | Value |
---|---|
Version | 7.6.3 |
Type | Bug |
TypeOfFailure | CompileTimePerformance |
Priority | normal |
Resolution | Unresolved |
Component | Compiler (NCG) |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | mail@nh2.me, simonmar |
Operating system | |
Architecture |