Suboptimal code produced for simple Cmm expression
The following Cmm code (produced while doing some experiments with some new optimization for case expressions), is compiled into the assembly below:
Cmm code:
==================== Post switch plan ====================
{offset
c1Ho: // global
_s1FE::P64 = R2;
if ((old + 0) - <highSp> < SpLim) (likely: False) goto c1Hp; else goto c1Hq;
c1Hp: // global
R2 = _s1FE::P64;
R1 = Main.$wgggg2_closure;
call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
c1Hq: // global
I64[(young<c1Hg> + 8)] = c1Hg;
R1 = _s1FE::P64;
if (R1 & 7 != 0) goto c1Hg; else goto c1Hh;
c1Hh: // global
call (I64[R1])(R1) returns to c1Hg, args: 8, res: 8, upd: 8;
c1Hg: // global
_s1FF::P64 = R1;
_c1Hn::P64 = _s1FF::P64 & 7;
_u1HA::I64 = ((528 >> _c1Hn::P64) >> _c1Hn::P64) & 3;
if (_u1HA::I64 == 1) goto c1Hl; else goto u1HB;
c1Hl: // global
R1 = 33;
call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
u1HB: // global
if (_u1HA::I64 == 2) goto c1Hm; else goto c1Hk;
c1Hm: // global
R1 = 440;
call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
c1Hk: // global
R1 = 44;
call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
}
Of interest is how the block starting at label C1Hg is compiled. We get the following code:
===========================================
movl $528,%eax
movq %rbx,%rcx <-- (***)
shrq %cl,%rax
movq %rbx,%rcx <---(***)
shrq %cl,%rax
andl $3,%eax
cmpq $1,%rax
je .Lc1Hl
===========================================
At issue is the duplicated instruction marked by (***). Why is it repeated? Looks unnecessary. If we look at the assembly produced by gcc for the following C program:
unsigned long __attribute__((noinline)) h(unsigned long x) {
return ((528ul >> x) >> x) & 3ul;
}
we get:
movl $528, %eax
shrq %cl, %rax
shrq %cl, %rax
andl $3, %eax
which looks optimal.