Make code generated by `dataToTag#` more compact (and efficient, hopefully)
I was looking into code generated by dataToTag#
. I defined blue = dataToTag#
and was suprised how unstraightforward the Cmm is that I got:
c19O: // global
if ((Sp + -8) < SpLim) (likely: False) goto c19P; else goto c19Q;
c19P: // global
R2 = R2;
R1 = Lib.blue_closure;
call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
c19Q: // global
_c19D::P64 = R2 & 7;
if (_c19D::P64 == 0) (likely: False) goto c19N; else goto c19M;
c19N: // global
I64[Sp - 8] = c19G;
R1 = R2;
Sp = Sp - 8;
if (R1 & 7 != 0) goto c19G; else goto c19H;
c19H: // global
call (I64[R1])(R1) returns to c19G, args: 8, res: 8, upd: 8;
c19G: // global
_c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R1 & (-8)] - 4]);
goto c19L;
c19M: // global
if (_c19D::P64 == 7) (likely: False) goto c19K; else goto c19J;
c19K: // global
Sp = Sp - 8;
_c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R2 & (-8)] - 4]);
goto c19L;
c19J: // global
Sp = Sp - 8;
_c19E::I64 = _c19D::P64 - 1;
goto c19L;
c19L: // global
R1 = _c19E::I64;
Sp = Sp + 8;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
That's an awful lot of code for such an (allegedly) simple function. I get that it cares about large constructor families, but note this is the critical path of the efficient case, where the arg is tagged < 7 (down to generated ASM):
_blk_c19O:
leaq -8(%rbp),%rax
cmpq %r15,%rax
jb _blk_c19P
_blk_c19Q:
movq %r14,%rax
andl $7,%eax
testq %rax,%rax
je _blk_c19N
_blk_c19M:
cmpq $7,%rax
je _blk_c19K
_blk_c19J:
addq $-8,%rbp
decq %rax
_blk_c19L:
movq %rax,%rbx
addq $8,%rbp
jmp *(%rbp)
- The stack check is unnecessary, I think. The hot path shouldn't need any stack space.
- The growing and the shrinking of the stack are unnecessary and cancel each other.
- There are two other conditional jumps. Hard to improve here.
- There is some shuffling of local variables going on the register allocator doesn't eliminate completely. It appears it fails to propagate the register constraint
%rbx
on the value representing the tag.
(1) and (2) are easily solved by extracting the slow "have to do evaluation" path into a shared, separate definition. (3) could be improved by encoding the "tag too large" case in tag == 1, so we only get a single cmpq $2,%rax; jb (or ja? AT&T is confusing) 1
. I imagine (4) is a side-effect of the register allocator getting thrown-off by the unlikely code paths. In summary, I'd really like to get the following code inlined unconditionally at call sites:
_blk_c19Q:
movq %r14,%rbx
andl $7,%ebx // btw.: Is relying on andl not clobbering the upper bits of %rbx a good idea?
cmpq $2,%rbx
jb shared_eval_case // or ja, not sure
subq $2, %rbx // 2, because of (3) above
jmp *(%rbp) // will not actually be present if inlined at call site, I suppose...
Sooo much better.