Codegen on sequential FFI calls is not very good
I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.
Consider this example API usage: https://github.com/chadaustin/buffer-builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs\#L46
It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:
addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz
I don't know why
rsp is being changed so much. I also can't explain the assignment to
eax before the call. (It should also be
xorl eax,eax, I would think.)
To my reading, the above instruction sequence could be reduced to:
movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz movq %rbx,%rdi movl $35,%esi call bw_append_byte movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz
To reproduce, check out
firstname.lastname@example.org:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be
cabal configure --enable-benchmarks cabal bench
And then look at the
This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.