Make code generated by `dataToTag#` more compact (and efficient, hopefully)

I was looking into code generated by dataToTag#. I defined blue = dataToTag# and was suprised how unstraightforward the Cmm is that I got:

       c19O: // global
           if ((Sp + -8) < SpLim) (likely: False) goto c19P; else goto c19Q;
       c19P: // global
           R2 = R2;
           R1 = Lib.blue_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       c19Q: // global
           _c19D::P64 = R2 & 7;
           if (_c19D::P64 == 0) (likely: False) goto c19N; else goto c19M;
       c19N: // global
           I64[Sp - 8] = c19G;
           R1 = R2;
           Sp = Sp - 8;
           if (R1 & 7 != 0) goto c19G; else goto c19H;
       c19H: // global
           call (I64[R1])(R1) returns to c19G, args: 8, res: 8, upd: 8;
       c19G: // global
           _c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R1 & (-8)] - 4]);
           goto c19L;
       c19M: // global
           if (_c19D::P64 == 7) (likely: False) goto c19K; else goto c19J;
       c19K: // global
           Sp = Sp - 8;
           _c19E::I64 = %MO_UU_Conv_W32_W64(I32[I64[R2 & (-8)] - 4]);
           goto c19L;
       c19J: // global
           Sp = Sp - 8;
           _c19E::I64 = _c19D::P64 - 1;
           goto c19L;
       c19L: // global
           R1 = _c19E::I64;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

That's an awful lot of code for such an (allegedly) simple function. I get that it cares about large constructor families, but note this is the critical path of the efficient case, where the arg is tagged < 7 (down to generated ASM):

_blk_c19O:
        leaq -8(%rbp),%rax
        cmpq %r15,%rax
        jb _blk_c19P
_blk_c19Q:
        movq %r14,%rax
        andl $7,%eax
        testq %rax,%rax
        je _blk_c19N
_blk_c19M:
        cmpq $7,%rax
        je _blk_c19K
_blk_c19J:
        addq $-8,%rbp
        decq %rax
_blk_c19L:
        movq %rax,%rbx
        addq $8,%rbp
        jmp *(%rbp)

The stack check is unnecessary, I think. The hot path shouldn't need any stack space.
The growing and the shrinking of the stack are unnecessary and cancel each other.
There are two other conditional jumps. Hard to improve here.
There is some shuffling of local variables going on the register allocator doesn't eliminate completely. It appears it fails to propagate the register constraint %rbx on the value representing the tag.

(1) and (2) are easily solved by extracting the slow "have to do evaluation" path into a shared, separate definition. (3) could be improved by encoding the "tag too large" case in tag == 1, so we only get a single cmpq $2,%rax; jb (or ja? AT&T is confusing) 1. I imagine (4) is a side-effect of the register allocator getting thrown-off by the unlikely code paths. In summary, I'd really like to get the following code inlined unconditionally at call sites:

_blk_c19Q:
        movq %r14,%rbx
        andl $7,%ebx // btw.: Is relying on andl not clobbering the upper bits of %rbx a good idea?
        cmpq $2,%rbx
        jb shared_eval_case // or ja, not sure
        subq $2, %rbx // 2, because of (3) above
        jmp *(%rbp) // will not actually be present if inlined at call site, I suppose...

Sooo much better.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information