Bad interaction between CmmSink limitations and unlifted data types.
NB: I ran into this with the tag inference patch. But case statements on unlifted data types will behave the same ones as infered tagged bindings.
We are looking at this Cmm after stack pointer layout.
{offset
c5BD:
_<stackCheck>
c5zT:
_s3H6::P64 = P64[_s3SB::P64 + 7];
<more loads>
...
<more loads>
_s3Hb::I64 = I64[_s3SB::P64 + 151];
_c5zW::I64 = %MO_S_Ge_W64(_s3SC::I64, 0);
_s3SD::I64 = _c5zW::I64;
....
c5G0: // global
_s3SF::P64 = I64[Main.ItemType_closure_tbl + (_s3SC::I64 << 3)];
I64[Sp - 184] = c5BI;
R1 = _s3SF::P64;
P64[Sp - 176] = _s3GY::P64;
<moreStores>
...
<moreStores>
P64[Sp - 8] = _s3SF::P64;
Sp = Sp - 184;
// perform tag check and evalute the closure if required
if (R1 & 7 != 0) goto c5BI; else goto c5G1;
c5G1: // global
call (I64[R1])(R1) returns to c5BI, args: 8, res: 8, upd: 8;
c5BI: // global
_s3GZ::P64 = P64[Sp + 16];
_s3Hb::I64 = I64[Sp + 64];
_s3Hd::I64 = I64[Sp + 72];
_s3Hi::P64 = P64[Sp + 96];
_s3Hk::P64 = P64[Sp + 104];
_s3Hm::P64 = P64[Sp + 112];
_s3Ho::P64 = P64[Sp + 120];
_s3SB::P64 = P64[Sp + 160];
_s3SC::I64 = I64[Sp + 168];
_s3Tz::P64 = R1;
_c5GG::P64 = _s3Tz::P64 & 7;
// Use the now evaluated value
if (_c5GG::P64 < 3) goto u5GK; else goto u5GL;
...
...
// later on we have another call but all live regs are already spilled so nothing to do.
call (I64[R1])(R1)
We run CmmSink and all the loads get sunk into the stores! Great! We get something like this
c5G0: // global
...
P64[Sp - 176] = P64[_s3SB::P64 + 63];
P64[Sp - 168] = P64[_s3SB::P64 + 47];
P64[Sp - 160] = P64[_s3SB::P64 + 71];
P64[Sp - 152] = P64[_s3SB::P64 + 7];
P64[Sp - 144] = P64[_s3SB::P64 + 39];
P64[Sp - 136] = P64[_s3SB::P64 + 23];
...
Issues arise when we replace the lifted type that was evaluated in c5G1 with a unlifted one. Why? The first call (forcing the early spills) disappear. So we get this Cmm:
c4Z0: // global
_s3Gp::P64 = P64[_s3Kj::P64 + 7];
_s3GX::P64 = P64[_s3Kj::P64 + 15];
lots of loads
_s3Gw::P64 = P64[_s3Kj::P64 + 87];
_s3GD::I64 = I64[_s3Kj::P64 + 119];
_s3Kn::P64 = I64[(_s3Kk::I64 << 3) + Main.ItemType_closure_tbl];
_c4Zf::P64 = _s3Kn::P64 & 7;
if (_c4Zf::P64 < 3) goto u4Zm; else goto u4Zn;
u4Zm: // global
if (_c4Zf::P64 < 2) goto c4YA; else goto c4YE;
c4YA: // global
_s3Kp::I64 = I64[_s3Kj::P64 + 127];
goto s3Ko;
c4YE: // global
I64[Sp - 128] = c4YD;
R1 = P64[_s3Kj::P64 + 95];
P64[Sp - 120] = _s3Gp::P64;
P64[Sp - 112] = _s3Gq::P64;
// lots of spills
I64[Sp - 16] = _s3Kk::I64;
P64[Sp - 8] = _s3Kn::P64;
Sp = Sp - 128;
if (R1 & 7 != 0) goto c4YD; else goto c4YF;
c4YF: // global
call (I64[R1])(R1) returns to c4YD, args: 8, res: 8, upd: 8;
This actually is better in some ways. We have eliminated the first call (no need to evaluate a unlifted thing!
However it means all loaded variables are live for a long stretch. In the case I was looking at we had 19 live variables loaded from the closure initially (see #20333).
This means we shuffle most of them:
- From the haskell heap into a reg
- From the reg onto the C stack
- From the C stack back into a reg
- From the reg onto the Haskell stack
- And eventually we end up loading them again when they are used.
This is in contrast to the code with a lifted type which only went heap->reg-HaskellStack.
For tag inference this caused one benchmark in nofib to be slower by about 20%. As (hopefully) more people will pick up unlifted types they sadly are also bound to run into this.
Why does this happen.
Our CFG looks something like this with lifted types:
digraph G {
load_vars->call1;
call1->alt1;
call1->alt2;
alt2->whatever[label="..."]
alt1->call2;
}
With lifted types we force a spill before the branch point. And reload the values again after. This allows us to sink the initial loads into the spills, because they are dead after this use site.
With unlifted types we eliminate call1
. So spilling is only required (and only happens) after control flow has branched off. CmmSink refuses to inline the loads as they are non-trivial in order to avoid work.
We still eventually spill them to the haskell stack anyway, at least on one branch. This means we end up first spilling them to the C stack, and later reload them from the C stack just to spill them to the haskell stack.
I'm not yet sure how to resolve this.