During STG->Cmm we should not always be so eager to load free variables
What's the problem
GHC potentially generates up to n
redundant reads/loads from memory for a closure with n
free variables.
How so?
Consider this test case:
foo !x1 !x2 !x3 !x4 !x5 list = map f list
where
f y = y + x1 + x2 + x3 + x4 :: Int
Here the binder f
turns into the following STG binder:
let {
sat_s297 [Occ=Once1] :: GHC.Types.Int -> GHC.Types.Int
[LclId] =
{ipv3_s28Y, ipv2_s28W, ipv1_s28U, ipv_s28S} \r [y_s290]
case y_s290 of {
GHC.Types.I# x_s292 [Occ=Once1] ->
case +# [x_s292 ipv_s28S] of sat_s293 [Occ=Once1] {
__DEFAULT ->
case +# [sat_s293 ipv1_s28U] of sat_s294 [Occ=Once1] {
__DEFAULT ->
case +# [sat_s294 ipv2_s28W] of sat_s295 [Occ=Once1] {
__DEFAULT ->
case +# [sat_s295 ipv3_s28Y] of sat_s296 [Occ=Once1] {
__DEFAULT -> GHC.Types.I# [sat_s296];
};
};
};
};
};
} in
So far so good. What do we then expect here? The code should:
- evaluate the argument
- load each free variable
- add them all up.
But in all it's grace GHC also loads, spills, and loads again each free variable
Here is the final Cmm:
[sat_s297_entry() { // [R2, R1]
{ info_tbls: [(c29D,
label: block_c29D_info
rep: StackRep [True, True, True, True]
srt: Nothing),
(c29G,
label: sat_s297_info
rep: HeapRep 4 nonptrs { Fun {arity: 1 fun_type: ArgSpec 5} }
srt: Nothing)]
stack_info: arg_space: 8
}
{offset
c29G: // global
if ((Sp + -40) < SpLim) (likely: False) goto c29W; else goto c29X;
c29W: // global
R2 = R2;
R1 = R1;
call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
c29X: // global
I64[Sp - 40] = c29D;
_s28Y::I64 = I64[R1 + 7];
_s28W::I64 = I64[R1 + 15];
_s28U::I64 = I64[R1 + 23];
_s28S::I64 = I64[R1 + 31];
R1 = R2;
I64[Sp - 32] = _s28S::I64;
I64[Sp - 24] = _s28U::I64;
I64[Sp - 16] = _s28W::I64;
I64[Sp - 8] = _s28Y::I64;
Sp = Sp - 40;
if (R1 & 7 != 0) goto c29D; else goto c29E;
c29E: // global
... evaluate x, goto c29Z
c29Z: // global
_s296::I64 = I64[R1 + 7] + (I64[Sp + 8] + (I64[Sp + 16] + (I64[Sp + 24] + I64[Sp + 32])));
I64[Hp - 8] = GHC.Types.I#_con_info;
I64[Hp] = _s296::I64;
R1 = Hp - 7;
Sp = Sp + 40;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
Analysis
The problem is that when generating code for a closure the first thing we emit after heap/stack checks is code which just dumps all the closures free variables into local variables.
However there are benefits to the current approach.
Benefits of eager loading.
Potential reduction in heap useage. We potentially turn the closure into garbage sooner if it's not retained by anything but itself. However I expect this doesn't translate to less space useage. We just shift our memory use to the stack (at a cost!) however I suppose we do save one word.
In branching code duplicating the loading code for free variables can also cause a increase in code size with all the usual associated cost. The redundant loading/spilling also increases code size. But once your branching factor grows past some point (3?) loading free variables early should be a clear win.
We don't have to keep the pointer for the closure around freeing up a variable. I guess that will always be a win, but only matters for farily small closures.
Downsides.
For recursive let's the heap useage reduction is moot. They generally keep themselves alive so we rarely realize a space saving. In fact space use will be worse at times as we do both, retain the original closure and copy the free vars onto the stack for use in the current execution.
Solutions?
Our goals are:
Minimize heap useage by freeing closures as soon as possible.
- The current approach is best-in-class here.
- But if we look at heap and stack use then storing all free variables on the stack is at best the same as retaining the closure. Maybe worse because currently we sometimes retain the closure anyway.
Minimize code size by avoiding the duplication of loads for free variabes.
- The current approach might seem best here but it really isn't always.
- We often have calls along the execution path which means we have to spill the free variables loaded from the closure, and then have to reload them from the stack again for a later use. This is at best as good as keeping them in the closure.
Minimize executed code by avoiding redundant loads/spills.
- The current code can be pretty horrid here (see the example above). It shouldn't be hard to beat the current approach in this regard.
Implementation considerations
For a "simple" incremental improvement we could simply defer loading of free variables until we either use any of them or we hit a branching point where both branches potentially use one of the free variables. (So basically when we come across a multi-alternative case).
However this means carrying around the current location of free variables as we traverse the stg ast and generate code. Not hard but it does add complexity.
Loading all free variables when any variable is used is essential for making sure we never keep the closure alive for too long without having to do more work in the compiler. Otherwise we might regress in memory useage compared to the current approach.
This should be strictly better than the current approach already!
If we want to get more involved we should track if the closure is reused. As long as a closure is reused along a code path we would like to keep using when it's beneficial. Basically we would as we come across variables check:
- If the free variable is available in a local variable that isn't at risk of having been spilled (by crossing a call). If so reuse it from that location.
- If the closure as a whole or all it's (pointer?) components aren't used past the current point "spill" all free variables into local variables. Reference the variables via these variables from here on out.
- Otherwise reload the variable from the closure.
I don't think the occurence info of "this free variable is still used later by this code path" is currently available. But in principle it should be cheap to compute. (Put an extension point on StgAlt perhaps?). And the current location mapping of free variables can be carried around in the codeGen state.
Sadly I don't expect to get around to addressing any of these points anytime soon.