Bad interaction between CmmSink limitations and unlifted data types.

NB: I ran into this with the tag inference patch. But case statements on unlifted data types will behave the same ones as infered tagged bindings.

We are looking at this Cmm after stack pointer layout.

{offset
  c5BD:
      _<stackCheck>
  c5zT:
      _s3H6::P64 = P64[_s3SB::P64 + 7];
      <more loads>
      ...
      <more loads>
      _s3Hb::I64 = I64[_s3SB::P64 + 151];
      _c5zW::I64 = %MO_S_Ge_W64(_s3SC::I64, 0);
      _s3SD::I64 = _c5zW::I64;
      ....
  c5G0: // global
      _s3SF::P64 = I64[Main.ItemType_closure_tbl + (_s3SC::I64 << 3)];
      I64[Sp - 184] = c5BI;
      R1 = _s3SF::P64;
      P64[Sp - 176] = _s3GY::P64;
      <moreStores>
      ...
      <moreStores>
      P64[Sp - 8] = _s3SF::P64;
      Sp = Sp - 184;
      // perform tag check and evalute the closure if required
      if (R1 & 7 != 0) goto c5BI; else goto c5G1;
  c5G1: // global
      call (I64[R1])(R1) returns to c5BI, args: 8, res: 8, upd: 8;
  c5BI: // global
      _s3GZ::P64 = P64[Sp + 16];
      _s3Hb::I64 = I64[Sp + 64];
      _s3Hd::I64 = I64[Sp + 72];
      _s3Hi::P64 = P64[Sp + 96];
      _s3Hk::P64 = P64[Sp + 104];
      _s3Hm::P64 = P64[Sp + 112];
      _s3Ho::P64 = P64[Sp + 120];
      _s3SB::P64 = P64[Sp + 160];
      _s3SC::I64 = I64[Sp + 168];
      _s3Tz::P64 = R1;
      _c5GG::P64 = _s3Tz::P64 & 7;
      // Use the now evaluated value
      if (_c5GG::P64 < 3) goto u5GK; else goto u5GL;
      ...
      ...
      // later on we have another call but all live regs are already spilled so nothing to do.
      call (I64[R1])(R1)

We run CmmSink and all the loads get sunk into the stores! Great! We get something like this

  c5G0: // global
      ...
      P64[Sp - 176] = P64[_s3SB::P64 + 63];
      P64[Sp - 168] = P64[_s3SB::P64 + 47];
      P64[Sp - 160] = P64[_s3SB::P64 + 71];
      P64[Sp - 152] = P64[_s3SB::P64 + 7];
      P64[Sp - 144] = P64[_s3SB::P64 + 39];
      P64[Sp - 136] = P64[_s3SB::P64 + 23];
      ...

Issues arise when we replace the lifted type that was evaluated in c5G1 with a unlifted one. Why? The first call (forcing the early spills) disappear. So we get this Cmm:

  c4Z0: // global
      _s3Gp::P64 = P64[_s3Kj::P64 + 7];
      _s3GX::P64 = P64[_s3Kj::P64 + 15];
      lots of loads
      _s3Gw::P64 = P64[_s3Kj::P64 + 87];
      _s3GD::I64 = I64[_s3Kj::P64 + 119];
      _s3Kn::P64 = I64[(_s3Kk::I64 << 3) + Main.ItemType_closure_tbl];
      _c4Zf::P64 = _s3Kn::P64 & 7;
      if (_c4Zf::P64 < 3) goto u4Zm; else goto u4Zn;
  u4Zm: // global
      if (_c4Zf::P64 < 2) goto c4YA; else goto c4YE;
  c4YA: // global
      _s3Kp::I64 = I64[_s3Kj::P64 + 127];
      goto s3Ko;
  c4YE: // global
      I64[Sp - 128] = c4YD;
      R1 = P64[_s3Kj::P64 + 95];
      P64[Sp - 120] = _s3Gp::P64;
      P64[Sp - 112] = _s3Gq::P64;
      // lots of spills
      I64[Sp - 16] = _s3Kk::I64;
      P64[Sp - 8] = _s3Kn::P64;
      Sp = Sp - 128;
      if (R1 & 7 != 0) goto c4YD; else goto c4YF;
  c4YF: // global
      call (I64[R1])(R1) returns to c4YD, args: 8, res: 8, upd: 8;

This actually is better in some ways. We have eliminated the first call (no need to evaluate a unlifted thing!

However it means all loaded variables are live for a long stretch. In the case I was looking at we had 19 live variables loaded from the closure initially (see #20333).

This means we shuffle most of them:

From the haskell heap into a reg
From the reg onto the C stack
From the C stack back into a reg
From the reg onto the Haskell stack
And eventually we end up loading them again when they are used.

This is in contrast to the code with a lifted type which only went heap->reg-HaskellStack.

For tag inference this caused one benchmark in nofib to be slower by about 20%. As (hopefully) more people will pick up unlifted types they sadly are also bound to run into this.

Why does this happen.

Our CFG looks something like this with lifted types:

digraph G {
    
    load_vars->call1;
    call1->alt1;
    call1->alt2;
    alt2->whatever[label="..."]
    alt1->call2;
    
}

With lifted types we force a spill before the branch point. And reload the values again after. This allows us to sink the initial loads into the spills, because they are dead after this use site.

With unlifted types we eliminate call1. So spilling is only required (and only happens) after control flow has branched off. CmmSink refuses to inline the loads as they are non-trivial in order to avoid work.

We still eventually spill them to the haskell stack anyway, at least on one branch. This means we end up first spilling them to the C stack, and later reload them from the C stack just to spill them to the haskell stack.

I'm not yet sure how to resolve this.

Edited Oct 28, 2021 by Simon Peyton Jones

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information