Make code generated by `dataToTag#` more compact (and efficient, hopefully)
I was looking into code generated by
dataToTag#. I defined
blue = dataToTag# and was suprised how unstraightforward the Cmm is that I got:
c19O: // global if ((Sp + -8) < SpLim) (likely: False) goto c19P; else goto c19Q; c19P: // global R2 = R2; R1 = Lib.blue_closure; call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8; c19Q: // global _c19D::P64 = R2 & 7; if (_c19D::P64 == 0) (likely: False) goto c19N; else goto c19M; c19N: // global I64[Sp - 8] = c19G; R1 = R2; Sp = Sp - 8; if (R1 & 7 != 0) goto c19G; else goto c19H; c19H: // global call (I64[R1])(R1) returns to c19G, args: 8, res: 8, upd: 8; c19G: // global _c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R1 & (-8)] - 4]); goto c19L; c19M: // global if (_c19D::P64 == 7) (likely: False) goto c19K; else goto c19J; c19K: // global Sp = Sp - 8; _c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R2 & (-8)] - 4]); goto c19L; c19J: // global Sp = Sp - 8; _c19E::I64 = _c19D::P64 - 1; goto c19L; c19L: // global R1 = _c19E::I64; Sp = Sp + 8; call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
That's an awful lot of code for such an (allegedly) simple function. I get that it cares about large constructor families, but note this is the critical path of the efficient case, where the arg is tagged < 7 (down to generated ASM):
_blk_c19O: leaq -8(%rbp),%rax cmpq %r15,%rax jb _blk_c19P _blk_c19Q: movq %r14,%rax andl $7,%eax testq %rax,%rax je _blk_c19N _blk_c19M: cmpq $7,%rax je _blk_c19K _blk_c19J: addq $-8,%rbp decq %rax _blk_c19L: movq %rax,%rbx addq $8,%rbp jmp *(%rbp)
- The stack check is unnecessary, I think. The hot path shouldn't need any stack space.
- The growing and the shrinking of the stack are unnecessary and cancel each other.
- There are two other conditional jumps. Hard to improve here.
- There is some shuffling of local variables going on the register allocator doesn't eliminate completely. It appears it fails to propagate the register constraint
%rbxon the value representing the tag.
(1) and (2) are easily solved by extracting the slow "have to do evaluation" path into a shared, separate definition. (3) could be improved by encoding the "tag too large" case in tag == 1, so we only get a single
cmpq $2,%rax; jb (or ja? AT&T is confusing) 1. I imagine (4) is a side-effect of the register allocator getting thrown-off by the unlikely code paths. In summary, I'd really like to get the following code inlined unconditionally at call sites:
_blk_c19Q: movq %r14,%rbx andl $7,%ebx // btw.: Is relying on andl not clobbering the upper bits of %rbx a good idea? cmpq $2,%rbx jb shared_eval_case // or ja, not sure subq $2, %rbx // 2, because of (3) above jmp *(%rbp) // will not actually be present if inlined at call site, I suppose...
Sooo much better.