Optimize large constructor updates.
Motivation
Consider code like this:
module M where
data Large
= L
{ x0 :: Int
, -- x1 .. x18
, x19 :: Int
}
updateLarge :: Large -> Int -> Large
updateLarge l n
= l { x9 = n}
This compiles to code which first loads all fields from the heap into registers/stack. Then allocates the new constructor by storing all the fields back into the new heap object.
[M.updateLarge_entry() // [R3, R2]
{ info_tbls: [(c1Cd,
label: block_c1Cd_info
rep: StackRep [False]
srt: Nothing),
(c1Cg,
label: M.updateLarge_info
rep: HeapRep static { Fun {arity: 2 fun_type: ArgSpec 15} }
srt: Nothing)]
stack_info: arg_space: 8 updfr_space: Just 8
}
{offset
...
c1Cn: // global
_s1w1::P64 = P64[R1 + 7];
_s1w2::P64 = P64[R1 + 15];
_s1w3::P64 = P64[R1 + 23];
_s1w4::P64 = P64[R1 + 31];
_s1w5::P64 = P64[R1 + 39];
_s1w6::P64 = P64[R1 + 47];
_s1w7::P64 = P64[R1 + 55];
_s1w8::P64 = P64[R1 + 63];
_s1w9::P64 = P64[R1 + 71];
_s1wb::P64 = P64[R1 + 87];
_s1wc::P64 = P64[R1 + 95];
_s1wd::P64 = P64[R1 + 103];
_s1we::P64 = P64[R1 + 111];
_s1wf::P64 = P64[R1 + 119];
_s1wg::P64 = P64[R1 + 127];
_s1wh::P64 = P64[R1 + 135];
_s1wi::P64 = P64[R1 + 143];
_s1wj::P64 = P64[R1 + 151];
_s1wk::P64 = P64[R1 + 159];
I64[Hp - 160] = M.L_con_info;
P64[Hp - 152] = _s1w1::P64;
P64[Hp - 144] = _s1w2::P64;
P64[Hp - 136] = _s1w3::P64;
P64[Hp - 128] = _s1w4::P64;
P64[Hp - 120] = _s1w5::P64;
P64[Hp - 112] = _s1w6::P64;
P64[Hp - 104] = _s1w7::P64;
P64[Hp - 96] = _s1w8::P64;
P64[Hp - 88] = _s1w9::P64;
P64[Hp - 80] = P64[Sp + 8];
P64[Hp - 72] = _s1wb::P64;
P64[Hp - 64] = _s1wc::P64;
P64[Hp - 56] = _s1wd::P64;
P64[Hp - 48] = _s1we::P64;
P64[Hp - 40] = _s1wf::P64;
P64[Hp - 32] = _s1wg::P64;
P64[Hp - 24] = _s1wh::P64;
P64[Hp - 16] = _s1wi::P64;
P64[Hp - 8] = _s1wj::P64;
P64[Hp] = _s1wk::P64;
R1 = Hp - 159;
Sp = Sp + 16;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
The problem is that we end up with n
live variables where n is the number of fields. Which is horrible if we run out of registers as we will start to emit code of the pattern:
mov large, reg
mov reg, stack
# (n - free regs) times
...
mov stack, reg
mov reg, heap
# (n - free regs) times
But we really only need memcpy like code:
mov large, reg
mov reg, heap
# (n times)
Possible solutions
Optimize the general pattern
The problem (in Cmm) here is that we end up with a lot of live variables with long live spans. This could be solved by interleaving loading of the fields and constructor allocation.
We want to go from:
_s1w1::P64 = P64[R1 + 7];
_s1w2::P64 = P64[R1 + 15];
_s1w3::P64 = P64[R1 + 23];
...
I64[Hp - 160] = M.L_con_info;
P64[Hp - 152] = _s1w1::P64;
P64[Hp - 144] = _s1w2::P64;
P64[Hp - 136] = _s1w3::P64;
....
to
_s1w1::P64 = P64[R1 + 7];
P64[Hp - 152] = _s1w1::P64;
_s1w2::P64 = P64[R1 + 15];
P64[Hp - 144] = _s1w2::P64;
...
In fact we already have a pass doing something like this with CmmSink
. However it leaves code sequences like this currently untouched. I assume to avoid issues with aliasing.
So in order to fix this we likely have to propagate the fact that addresses accessed via R1 and HP will never alias.
To my knowledge in any function generated from STG/Core HP relative stores should never alias with non-HP relative loads when allocating immutable objects. So this seems feasible.
Special case record updates
We could special case record updates for large records to first copy the existing record and then update the field(s) changed.
I would much prefer the other option of improving aliasing, as it has the potential to improve cases where similar patterns occur for different reasons. But it might be easier to go this route.