CmmToAsm/x86: Poor unrolling for unaligned memcpy

Compile the following module, and look at the generated assembly:

module Memcpy where

import Foreign
import Data.Word

fun :: Ptr Word8 -> Ptr Word8 -> IO ()
fun p q = copyBytes p q 16

When I compile with ghc-9.8.2 -O -dno-typeable-binds Memcpy.hs -S on my x86-64 machine, this is what I see:

.section .text
.align 8
.align 8
        .quad   12884901903
        .quad   0
        .long   14
        .long   0
.globl Memcpy_fun1_info
.type Memcpy_fun1_info, @function
Memcpy_fun1_info:
.LcRu:
        leaq -16(%rbp),%rax
        cmpq %r15,%rax
        jb .LcRy
.LcRz:
        movq $.LcRr_info,-16(%rbp)
        movq %r14,%rbx
        movq %rsi,-8(%rbp)
        addq $-16,%rbp
        testb $7,%bl
        jne .LcRr
.LcRs:
        jmp *(%rbx)
.align 8
        .quad   1
        .long   30
        .long   0
.LcRr_info:
.LcRr:
        movq $.LcRx_info,(%rbp)
        movq 7(%rbx),%rax
        movq 8(%rbp),%rbx
        movq %rax,8(%rbp)
        testb $7,%bl
        jne .LcRx
.LcRB:
        jmp *(%rbx)
.align 8
        .quad   65
        .long   30
        .long   0
.LcRx_info:
.LcRx:
        movq 8(%rbp),%rax
        movq 7(%rbx),%rbx
        ;; !!!
        movb 0(%rbx),%cl
        movb %cl,0(%rax)
        movb 1(%rbx),%cl
        movb %cl,1(%rax)
        movb 2(%rbx),%cl
        movb %cl,2(%rax)
        movb 3(%rbx),%cl
        movb %cl,3(%rax)
        movb 4(%rbx),%cl
        movb %cl,4(%rax)
        movb 5(%rbx),%cl
        movb %cl,5(%rax)
        movb 6(%rbx),%cl
        movb %cl,6(%rax)
        movb 7(%rbx),%cl
        movb %cl,7(%rax)
        movb 8(%rbx),%cl
        movb %cl,8(%rax)
        movb 9(%rbx),%cl
        movb %cl,9(%rax)
        movb 10(%rbx),%cl
        movb %cl,10(%rax)
        movb 11(%rbx),%cl
        movb %cl,11(%rax)
        movb 12(%rbx),%cl
        movb %cl,12(%rax)
        movb 13(%rbx),%cl
        movb %cl,13(%rax)
        movb 14(%rbx),%cl
        movb %cl,14(%rax)
        movb 15(%rbx),%bl
        movb %bl,15(%rax)
        leaq ghczmprim_GHCziTupleziPrim_Z0T_closure+1(%rip),%rbx
        addq $16,%rbp
        jmp *(%rbp)
.LcRy:
        leaq Memcpy_fun1_closure(%rip),%rbx
        jmp *-8(%r13)
        .size Memcpy_fun1_info, .-Memcpy_fun1_info
.section .data
.align 8
.align 1
.globl Memcpy_fun1_closure
.type Memcpy_fun1_closure, @object
Memcpy_fun1_closure:
        .quad   Memcpy_fun1_info
.section .text
.align 8
.align 8
        .quad   12884901903
        .quad   0
        .long   14
        .long   0
.globl Memcpy_fun_info
.type Memcpy_fun_info, @function
Memcpy_fun_info:
.LcRU:
        jmp Memcpy_fun1_info
        .size Memcpy_fun_info, .-Memcpy_fun_info
.section .data
.align 8
.align 1
.globl Memcpy_fun_closure
.type Memcpy_fun_closure, @object
Memcpy_fun_closure:
        .quad   Memcpy_fun_info
.section .note.GNU-stack,"",@progbits
.ident "GHC 9.8.2"

Notice the 32 single-byte movb instructions used to implement the copy. (I have marked this with !!!.)

We are probably using single-byte move instructions because the given Ptrs can have arbitrary alignment. But on x86-64, the penalty for an unaligned movq is typically at most a few cycles. It would be much better for both speed and code size to emit four eight-byte move instructions or even two unaligned 16-byte move instructions. (There are a few of the latter in SSE and SSE2, which we assume are available on every x86-64 machine.) And indeed I see the latter if I also add -fllvm.

I haven't checked, but would not be surprised if there were similar opportunities for improvement available on other platforms.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information