Codegen on sequential FFI calls is not very good

I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.

Consider this example API usage: https://github.com/chadaustin/buffer-builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46

It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:

addq $8,%rsp
movq %rbx,%rdi
movq 72(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

addq $8,%rsp
movq %rbx,%rdi
movl $35,%esi
subq $8,%rsp
movl $0,%eax
call bw_append_byte

addq $8,%rsp
movq %rbx,%rdi
movq 64(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.)

To my reading, the above instruction sequence could be reduced to:

movq %rbx,%rdi
movq 64(%rsp),%rsi
call bw_append_bsz

movq %rbx,%rdi
movl $35,%esi
call bw_append_byte

movq %rbx,%rdi
movq 56(%rsp),%rsi
call bw_append_bsz

To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be

cabal configure --enable-benchmarks
cabal bench

And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file.

This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.

Edited Mar 10, 2019 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Codegen on sequential FFI calls is not very good