CmmToAsm/x86: Poor unrolling for unaligned memcpy
Compile the following module, and look at the generated assembly:
module Memcpy where
import Foreign
import Data.Word
fun :: Ptr Word8 -> Ptr Word8 -> IO ()
fun p q = copyBytes p q 16
When I compile with ghc-9.8.2 -O -dno-typeable-binds Memcpy.hs -S
on my x86-64 machine, this is what I see:
.section .text
.align 8
.align 8
.quad 12884901903
.quad 0
.long 14
.long 0
.globl Memcpy_fun1_info
.type Memcpy_fun1_info, @function
Memcpy_fun1_info:
.LcRu:
leaq -16(%rbp),%rax
cmpq %r15,%rax
jb .LcRy
.LcRz:
movq $.LcRr_info,-16(%rbp)
movq %r14,%rbx
movq %rsi,-8(%rbp)
addq $-16,%rbp
testb $7,%bl
jne .LcRr
.LcRs:
jmp *(%rbx)
.align 8
.quad 1
.long 30
.long 0
.LcRr_info:
.LcRr:
movq $.LcRx_info,(%rbp)
movq 7(%rbx),%rax
movq 8(%rbp),%rbx
movq %rax,8(%rbp)
testb $7,%bl
jne .LcRx
.LcRB:
jmp *(%rbx)
.align 8
.quad 65
.long 30
.long 0
.LcRx_info:
.LcRx:
movq 8(%rbp),%rax
movq 7(%rbx),%rbx
;; !!!
movb 0(%rbx),%cl
movb %cl,0(%rax)
movb 1(%rbx),%cl
movb %cl,1(%rax)
movb 2(%rbx),%cl
movb %cl,2(%rax)
movb 3(%rbx),%cl
movb %cl,3(%rax)
movb 4(%rbx),%cl
movb %cl,4(%rax)
movb 5(%rbx),%cl
movb %cl,5(%rax)
movb 6(%rbx),%cl
movb %cl,6(%rax)
movb 7(%rbx),%cl
movb %cl,7(%rax)
movb 8(%rbx),%cl
movb %cl,8(%rax)
movb 9(%rbx),%cl
movb %cl,9(%rax)
movb 10(%rbx),%cl
movb %cl,10(%rax)
movb 11(%rbx),%cl
movb %cl,11(%rax)
movb 12(%rbx),%cl
movb %cl,12(%rax)
movb 13(%rbx),%cl
movb %cl,13(%rax)
movb 14(%rbx),%cl
movb %cl,14(%rax)
movb 15(%rbx),%bl
movb %bl,15(%rax)
leaq ghczmprim_GHCziTupleziPrim_Z0T_closure+1(%rip),%rbx
addq $16,%rbp
jmp *(%rbp)
.LcRy:
leaq Memcpy_fun1_closure(%rip),%rbx
jmp *-8(%r13)
.size Memcpy_fun1_info, .-Memcpy_fun1_info
.section .data
.align 8
.align 1
.globl Memcpy_fun1_closure
.type Memcpy_fun1_closure, @object
Memcpy_fun1_closure:
.quad Memcpy_fun1_info
.section .text
.align 8
.align 8
.quad 12884901903
.quad 0
.long 14
.long 0
.globl Memcpy_fun_info
.type Memcpy_fun_info, @function
Memcpy_fun_info:
.LcRU:
jmp Memcpy_fun1_info
.size Memcpy_fun_info, .-Memcpy_fun_info
.section .data
.align 8
.align 1
.globl Memcpy_fun_closure
.type Memcpy_fun_closure, @object
Memcpy_fun_closure:
.quad Memcpy_fun_info
.section .note.GNU-stack,"",@progbits
.ident "GHC 9.8.2"
Notice the 32 single-byte movb
instructions used to implement the copy. (I have marked this with !!!
.)
We are probably using single-byte move instructions because the given Ptr
s can have arbitrary alignment. But on x86-64, the penalty for an unaligned movq
is typically at most a few cycles. It would be much better for both speed and code size to emit four eight-byte move instructions or even two unaligned 16-byte move instructions. (There are a few of the latter in SSE and SSE2, which we assume are available on every x86-64 machine.) And indeed I see the latter if I also add -fllvm
.
I haven't checked, but would not be surprised if there were similar opportunities for improvement available on other platforms.